AD-A257  833 


HfllBHIIRI 

FINAL  TECHNICAL  REPORT 
for 

CONTRACT  NUMBER  N00014-88-k-0723 
and  project  entitled 


RECONFIGURATION  SCHEMES 
FOR 

FAULT-TOLERANT  PROCESSOR  ARRAYS 


Period:  88  Jul  01  to  92  Aug  31 

(includes  1  year  no-cost  extension) 

Scientific  Officer:  Dr.  Clifford  Lau 

ONR  Electronics  Division 


Prepared  by:  Dr.  Josd  A.  B.  Fortes 

Purdue  University 


Date: 


92  Oct  15 


Thi$  document  has  been  approved 
to!  public  celease  and  sole:  in 
duinbutioa  it  unlimited. 


S  ' 

-J 


DISCLAIMER  NOTICE 


THIS  DOCUMENT  IS  BEST 
QUALITY  AVAILABLE.  THE  COPY 
FURNISHED  TO  DTIC  CONTAINED 
A  SIGNIFICANT  NUMBER  OF 
PAGES  WHICH  DO  NOT 
REPRODUCE  LEGIBLY. 


FINAL  TECHNICAL  REPORT 
for 

CONTRACT  NUMBER  N00014-88-k-0723 
and  project  entitled 

RECONFIGURATION  SCHEMES 
FOR 

FAULT-TOLERANT  PROCESSOR  ARRAYS 


Period:  88  Jul  01  to  92  Aug  3 1 

(includes  1  year  no-cost  extension) 

Scientific  Officer:  Dr.  Clifford  Lau 

ONR  Electronics  Division 

Prepared  by:  Dr.  Jos£  A.  B.  Fortes 

Purdue  University 

Date:  92  Oct  15 


Dist  A.  per  telecon  Dr.  C.  Lau 
ONR/Code  1114SE 
Arlington,  VA  22217-5000 


11/20/92  CG 


Distribution: 


Addressee 

Dr.  Clifford  Lau 
Electronics  Division 

Office  of  Naval  Research  (Code  1 1 14SE) 
800  N.  Quincy  Street 
Arlington,  VA  22217-5000 


Administrative  Contracting  Officer 
Office  of  Naval  Research 
Resident  Representative 
536  South  Clark  Street,  Rm  286 
Chicago,  IL  60605-1588 


Director,  Naval  Research  Laboratory 
ATTN:  Code  2627 
Washington,  D.C.  20375 


Defense  Technical  Information  Center 
Building  5,  Cameron  Station 
Alexandria,  VA  22314 


Number  of  Copies 
1 


1 


6 


12 


Table  of  Contents 


1.  Introduction 

2.  Annotated  list  of  publications 

3.  Thesis  publications 

4.  Graduate  students  supported 

5.  Invited  lectures  by  Jose  Fortes 

6.  Honors  and  editorial  positions  of  Jose  Fortes 

7.  Reprints  of  publications 


1.  Introduction 


This  project  addressed  several  aspects  of  the  the  problem  of  designing  highly-reliable 
dynamically  reconfigurable  processor  arrays.  The  proposed  work  focused  mainly  on 
reconfiguration  schemes  required  to  implement  fault-tolerant  processor  arrays. 
According  to  the  original  statement  of  work,  the  following  complementary  objectives 
were  pursued  (relevant  references  to  work  performed  under  the  grant  appear  in 
parentheses): 

1.  a  methodology  for  the  design  and  evaluation  of  processor-switched  arrays 
([2],  [7J,  [8],  [11],  [12],  [13],  [24]), 

2.  a  methodology  for  the  design  and  evaluation  of  multi-level  hierarchically 
reconfigurable  processor  arrays  ([10],[23],[25]), 

3.  a  methodology  for  the  design  of  fault-tolerant  interconnection  routers  for 
processor  arrays  with  decentralized  routing  control  ([5],  [17],  [19],  [20], 
[22],  [27])  and 

4.  algorithm  reconfiguration  strategies  which,  together  with  hardware 
reconfiguration  schemes,  can  be  used  to  achieve  graceful  degradation  in 
processor  arrays  ([1],  [3],  [4],  [6],  [19],  [15],  [16],  [26]). 

The  emphasis  of  the  proposed  research  was  on  the  development  of  optimal 
reconfiguration  schemes  for  each  of  the  above  objectives  by  using  mathematical  and 
simulation  tools.  For  this  purpose,  evaluation  methods  and  adequate  measures  were 
also  studied  and  developed.  These  measures  include  not  only  reliability  but  also  joint 
measures  of  performance,  hardware  area  and  reliability.  Software  tools  were 
developed  to  help  in  the  process  of  evaluating  the  reconfiguration  schemes  proposed. 
The  contributions  of  this  work  include  a  sound  mathematical  framework  for  the 
design  of  reliable  reconfigurable  processor  arrays,  tools  for  their  evaluation  and  novel 
hardware  designs  for  processor  arrays  and  their  components.  These  results  will  allow 
the  design  of  very  large  processor  array  systems  suited  for  applications  characterized 
by  the  need  to  survive  without  maintenance  in  hard-to-predict  harsh  environments 
and  long  mission  times.  These  areas  of  application  include,  among  others,  real-time 
computers,  digital  signal,  image  and  speech  processing  systems,  robotics,  aerospace 
and  transportation  vehicles  and  remote  data  processing  and  sensing  systems. 


2.  Annotated  list  of  publications 


The  following  publications  report  work  funded  in  part  by  contract  N00014-88-k- 
0723.  After  each  reference  a  short  explanation  of  its  contents  follows.  For  those  refer¬ 
ences  marked  by  an  asterisk  the  full  publication  is  included  as  in  Section  3  of  this 
report  (Reprints  of  publications).  The  other  references  are  to  either  earlier  versions  or 
conference  versions  of  those  marked  by  asterisks  (most  of  which  are  archival  journal 
papers). 

[1] *  Shang,  W.  and  Fortes,  J.  A.  B.,  "On  the  Optimality  of  Linear  Schedules,"  Jour¬ 

nal  of  VLSI  Signal  Processing,  Volume  1,  Number  3,  November  1989,  pp.  209- 

220. 

Note  -  This  paper  shows  how  to  compare  the  optimal  linear  schedule  for  a  single 
output  computed  by  a  recurrence-like  algorithm.  These  schedules  and  algo¬ 
rithms  capture  the  majority  of  systolic  array  algorithms.  It  is  also  shown  that 
optimal  linear  schedules  are  very  close  to  the  best  possible  schedules  and,  in 
many  cases,  just  as  good.  These  results  are  relevant  to  the  problem  of  reschedul¬ 
ing  systolic  algorithms  in  arrays  with  faulty  components. 

[2] *  Chean,  M„  and  Fortes,  J.  A.  B.,  "A  Taxonomy  of  Reconfiguration  Techniques 

for  Fault-Tolerant  Processor  Arrays,"  IEEE  Computer,  December  1990,  pp.  55- 
69. 

Note-  This  paper  surveys  and  classifies  processor  array  reconfiguration  schemes 
proposed  by  the  authors  and  other  researchers.  Criteria  for  evaluation  of  the 
schemes  and  their  relative  merits  are  also  discussed. 

[3] *  Shang,  W.,  and  Fortes,  J.  A.  B.,  "Time  Optimal  Linear  Schedules  for  Algorithms 

with  Uniform  Dependencies,"  IEEE  Transactions  on  Computers,  Volume  40, 
Number  6,  June  1991,  pp.  723-743. 

Note  -  This  paper  extends  the  results  of  reference  [1]  to  the  case  when  several 
outputs  must  be  computed  in  the  minimal  amount  of  time. 

[4] *  Shang,  W.  and  Fortes,  J.  A.  B.,  "On  the  Independent  Partitioning  of  Algorithms 

with  Uniform  Data  Dependencies,"  IEEE  Transactions  on  Computers,  Volume 
41,  Number  2,  February  1992,  pp.  190-206. 

Note  -  This  paper  presents  a  technique  for  optimal  partition  of  algorithms  into 
blocks  of  computations  among  which  no  communication  is  required.  The  same 
techniques  can  be  used  for  the  case  when  limited  communication  is  possible. 
These  techniques  are  potentially  useful  to  reallocate  computations  in  multipro¬ 
cessor  systems  where  failures  result  in  diminished  connectivity  among  proces¬ 
sors. 

[5] *  Rau,  D.,  Fortes,  J.  A.  B.  and  Siegel,  H.  J.,  "Destination  Tag  Routing  Techniques 

Based  on  a  State  Model  for  the  IADM  Network,"  IEEE  Transactions  on  Com¬ 
puters,  Volume  41,  Number  3,  March  1992,  pp.  274-286. 

Note  -  This  paper  proposes  a  simple  destination-tag  routing  mechanism  for  a 
class  of  networks.  It  also  shows  how  the  message  routing  scheme  supports  the 
rerouting  of  messages  around  faulty  nodes  and  links.  These  results  are  of  use  in 
fault-tolerant  processor  arrays  where  a  multistage  network  is  used  to  connect 
processors. 

[6] *  Shang,  W.  and  Fortes,  J.  A.  B.,  "On  Mapping  Uniform  Dependence  Algorithms 

into  Lower  Dimensional  Processor  Arrays,"  IEEE  Transactions  on  Parallel  and 
Distributed  Systems,  Volume  3,  Number  3,  May  1992,  pp.  350-363. 


Note  -  Two-dimensional  processors  with  faulty  elements  can  be  easily 
reconfigured  as  one-dimensional  arrays.  However,  it  is  necessary  to  remap  the 
original  algorithm  into  an  array  of  smaller  dimension  than  that  of  the  original 
array.  This  paper  presents  a  mapping  methodology  which  addresses  this  prob¬ 
lem. 

[7] *  Chean,  M.,  and  Fortes,  J.  A.  B.,  "The  Full-Use-of-Suitable-Spaces  (FUSS) 

Approach  to  Hardware  Reconfiguration  for  Fault-Tolerance  Processor  Arrays," 
IEEE  Transactions  on  Computers,  Vol.  39,  No.  4,  April  1990,  pp.  564-571. 

Note  -  This  paper  describes  a  new  technique  for  processor  array  reconfiguration 
and  discusses  its  performance.  It  allows  a  processor  array  to  reconfigure  in  the 
presence  of  up  to  as  many  faults  as  the  number  of  spare  processors  available 
with  a  piobability  of  more  than  98%. 

[8] *  Lopez-Benitez,  N.  and  Fortes,  J.  A.  B.,  "Detailed  Modeling  and  Reliability 

Analysis  of  Fault-Tolerant  Processor  Arrays,"  IEEE  Transactions  on  Comput¬ 
ers.  Vol.  41,  No.  9,  September  1992,  pp.  1193-1200. 

Note  -  This  paper  presents  an  efficient  approach  to  the  problem  of  modeling 
and  evaluating  the  reliability  of  fault-tolerant  processor  arrays.  It  also  reports 
results  obtained  by  using  a  software  package  that  implements  the  proposed 
approach. 

[9]  O’Keefe,  M.  T.,  Fortes,  J.  A.  B.  and  Wah,  B.  W.,  "On  the  Relationship 
Between  Two  Systolic  Array  Design  Methodologies,"  to  appear  in  IEEE  Tran¬ 
sactions  on  Computers. 

Note  -  This  paper  extends  earlier  work  by  the  same  authors  and  shows  how  the 
understanding  of  the  relations  between  two  design  methodologies  was  useful  in 
deriving  new  systolic  array  designs.  The  techniques  are  of  use  to  algorithm 
reconfiguration  as  well. 

[10] *  Wang,  Y-X.  and  Fortes,  J.A.B.,  "On  the  Analysis  and  Design  of  Hierarchical 

Fault-Tolerant  Processor  Arrays,"  International  Workshop  on  Defect  and 
Fault-Tolerance  in  VLSI  Systems,  October  6-7,  1988,  Springfield,  Mas¬ 
sachusetts. 

Note  -  This  paper  addresses  the  problem  of  designing  hierarchically  organized 
fault-tolerant  processor  arrays  so  that  their  reliability  is  optimized. 

[11]  Chean,  M.  and  Fortes,  J.  A.  B.,  "FUSS:  A  Reconfiguration  Scheme  for  Fault- 
Tolerant  Processor  Arrays,"  International  Workshop  on  Hardware  Fault- 
Tolerance  in  Multiprocessors,  June  19-20, 1989,  Urbana,  Illinois. 

Note  -  this  paper  reports  in  "extended-abstract"  form  work  that  led  to  the  results 
in  [7]. 

[12]  Lopez-Benitez,  N.  and  Fortes,  J.  A.  B.,  "Detailed  Modeling  of  Fault-Tolerant 
Processor  Arrays,"  19th  Int’l  Symposium  on  Fault-Tolerant  Computing,  June 
21-23,  1989,  Chicago,  Illinois,  pp.  545-552. 

Note  -  Extended  conference  version  of  [8]. 

[13]*  Wang,  Y.-X  and  Fortes,  J.A.B.,  "Estimates  of  MTTF  and  Optimal  Number  of 
Spares  of  Fault-Tolerant  Processor  Arrays,"  20th  Int’l  Symposium  on  Fault- 
Tolerant  Computing,  June  26-28,  1990,  Newcastle  Upon  Tyne,  U.K.,  pp.  184- 
191. 

Note  -  This  paper  presents  an  efficient  analytical  approach  to  the  evaluation  of 
Mean-Time-To-Failure  (MTTF)  of  fault-tolerant  arrays.  It  also  proposes  a 
design  procedure  that  optimizes  MTTF. 


[14]  Shang,  W.,  and  Fortes,  J.A.B.,  'Time-Optimal  and  Conflict-Free  Mappings  of 
Uniform  Dependence  Algorithms  into  Lower  Dimensional  Processor  Arrays, " 
1990  International  Conference  on  Parallel  Processing,  St.  Charles,  Illinois,  Vol. 
I,pp.  101-110. 

Note  -  An  earlier  conference  version  of  [6]. 

[15] *  Shang,  W.,  O’Keefe,  M.  T.,  and  Fortes,  J.  A.  B.,  "On  Loop  Transformations  for 

Generalized  Cycle  Shrinking,"  1991  International  Conference  on  Parallel  Pro¬ 
cessing,  August,  St.  Charles,  Illinois,  Vol.  II,  pp.  132-141.  Improved  version  to 
appear  in  IEEE  Transactions  on  Parallel  and  Distributed  Systems. 

Note  -  This  paper  shows  how  certain  compiler  optimizations  (genetically 
denoted  as  cycle  shrinking  transformations)  can  be  unified  and  generalized  by 
linear  schedules  such  as  those  used  to  schedule  systolic  algorithms. 

[16]  Shang,  W.,  Yang,  Z.  and  Fortes,  J.  A.  B.,  "Conflict-Free  Scheduling  of  Netted 
Loop  Algorithms  on  Lower  Dimensional  Processor  Arrays,"  Sixth  International 
Parallel  Processing  Symposium,  March  23-26,  1992,  Beverly  Hills,  California. 

Note  -  This  paper  improves  upon  the  method  proposed  in  [  ]  by  proposing  a 
conflict-avoidance  procedure  with  reduced  computational  complexity. 

[17] *  Cam,  H.  and  Fortes,  J.  A.  B.,  "Fault-Tolerant  Self-Routing  Permutation  Net¬ 

works,"  1992  International  Conference  on  Parallel  Processing,  August  17-21, 
St.  Charles,  Illinois,  Vol.  I,  pp.  243-247. 

Note  -  this  paper  proposes  a  novel  fault-tolerant  interconnection  network  capa¬ 
ble  of  implementing  any  permutation  using  distributed  destination-tag  based 
self-routing  of  messages.  The  time  needed  to  implement  any  permutation  is  the 
smallest  of  any  network  with  comparable  hardware  complexity. 

[18] *  Saghi,  G.,  Siegel,  H.  J.,  and  Fortes,  J.  A.  B.,  "On  the  Viability  of  a  Quantitative 

Model  of  System  Reconfiguration  Due  to  a  Fault,"  1992  International  Confer¬ 
ence  on  Parallel  Processing,  August  17-21,  St.  Charles,  Illinois,  Vol.  I,  pp. 
233-242. 

Note  -  This  paper  proposes  a  quantitative  model  that  can  be  used  by  an  operat¬ 
ing  system  to  decide  what  reconfiguration  strategy  should  be  used.  It  discusses 
how  the  costs  of  three  reconfiguration  schemes  may  be  quantifiable  in  a  realis¬ 
tic  setting. 

[19] *  Cam,  H.  and  Fortes,  J.  A.  B.,  "Frames:  A  Simple  Characterization  of  Permuta¬ 

tions  realized  by  Frequently  Used  Networks,"  Tech.  Rpt.  TR-EE-92-32,  School 
of  Electrical  Engineering,  Purdue  University,  July  1992, 44  pages. 

Note  -  This  report  proposes  a  simple  characterization  of  permutation  capabili¬ 
ties  of  interconnection  networks.  The  proposed  approach  makes  it  computation¬ 
ally  easy  to  detect  whether  a  particular  interconnection  pattern  (i.e., 
configuration)  can  be  implemented  by  a  network.  Submitted  for  publication  in 
IEEE  Transactions  on  Computers. 

[20] *  Cam,  H.  and  Fortes,  J.  A.  B.,  "New  Routing  Algorithms  for  Benes  and  Reduced 

Networks,"  Submitted  for  publication  in  IEEE  Transactions  on  Com¬ 
puters.  Note  -  this  paper  presents  algorithms  for  the  routing  of  messages  in  two 
important  rearrangeable  networks.  They  can  be  used  to  set  processor  array  sys¬ 
tems  into  configurations  that  exclude  faulty  components. 

[21]  Lopez- Benitez,  N.  and  Fortes,  J.  A.  B„  "Detailed  Modeling  and  Reliability 
Analysis  of  Fault- tolerant  Processor  Arrays,"  Tech.  Rpt.  TR-EE-89-31,  School 
of  Electrical  Engineering,  Purdue  University,  June  1989, 94  pages. 


Note  -  Long  version  of  [8]. 

3.  Thesis  publications 

[22]  Rau,  D.,  (December  1988)  "Destination-Tag  Routing  Schemes  for  Multistage 
Interconnection  Networks  with  Redundant  Paths" 

[23]  Lopez- Benitez,  N.,  (August  1989)  Detailed  Modeling  and  Reliability  Estima¬ 
tion  of  Fault-Tolerant  Processor  Arrays 

[24]  Chean,  M.,  (August  1989)  Hardware  Reconfiguration  for  Fault-Tolerant  Pro¬ 
cessor  Arrays 

[25]  Wang,  Y-X.,  (May  1990)  Approximate  Evaluation  of  Reliability,  Mean-Time- 
To-Failure  and  Optimal  Redundancy  of  Fault-Tolerant  Processor  Arrays 

[26]  Shang,  W.,  (May  1990)  Scheduling,  Partitioning  and  Mapping  of  Uniform 
Dependence  Algorithms  on  Processor  Arrays 

[27]  Cam,  H.  ,  (May  1992)  Design  and  Permutation  Routing  Algorithms  of  Rear- 
rangeable  Networks 


4.  Graduate  students  supported 

Darween  Rau  (male,  Asian  non-US) 

Mengly  Chean  (male,  Asian  US) 

Weijia  Shang  (female,  Asian  non-US) 

Noe  Lopez-Benitez  (male,  Hispanic  US) 

Mathew  O’Keefe  (male,  US) 

5.  Invited  Lectures  of  J.  Fortes 

[1]  "Processor  Arrays:  Concept,  Design,  Applications  and  Real  Machines,"  ID 
Simposium  Intemacional  Computacion  UDEM’90,  Universidad  de  Monterrey, 
Monterrey,  Mexico,  September  28, 1990. 

[2]  "Systolic  Array  Design  in  the  Linear  Algebra  Framework,"  Minisymposium  on 
Linear  Algebra  in  Systolic  Arrays,  Second  SIAM  Conference  on  Linear  Alge¬ 
bra,  San  Francisco,  November  5-8, 1990. 

[3]  "Mapping  Algorithms  onto  Parallel  Architectures:  Time  Schedules,"  Second 
International  DEE  Specialist  Seminar  on  "The  Design  and  Application  of  Paral¬ 
lel  Digital  Processors,"  Lisbon,  Portugal,  April  15-19,  1991. 

[4]  "Generalized  Cycle  Shrinking,"  Conference  on  Algorithms  and  Parallel  VLSI 
Architectures  II,  Chateau  de  Bonas,  Gers,  France,  June  1991. 

[5]  "Exploiting  Parallelism  with  Linear  Schedules,"  International  Conference  for 
Young  Computer  Scientists,  Beijing,  China,  July  1991. 

[6]  "Linear  Schedules:  Optimization  and  Relation  to  Parallelizing  Compiler  Tech¬ 
niques,"  Center  for  Supercomputing  Research  and  Development,  University  of 
Illinois,  February  4,  1992. 

[7]  "On  the  Adaptability  of  Algorithms  and  Architectures,"  CONPAR-VAPP  92 
(1992  Conference  on  Parallel  Processing  -  Vector  and  Array  Parallel  Proces¬ 
sors),  Lyon,  France,  September  1-4,  1992. 


6.  Honors  and  Editorial  Positions  of  J.  Fortes 


1989  Guest  Editor  (with  S.  Y.  Rung)  of  2  special  issues  of  Journal  of  VLSI  Signal 
Processing  on  Systolic  Systems. 

Subject  Area  Editor  for  Journal  of  Parallel  and  Distributed  Computing 
1990-Present 

Subject  Area  Editor  lot  Journal  of  Parallel  and  Distributed  Computing 
Member  of  Editorial  Board  of  Journal  of  VLSI  Signal  Processing 

1992-Present 

Member  of  Editorial  Board  of  IEEE  Transactions  on  Parallel  and  Distributed 
Systems 

J.  Fortes  -  IEEE  Computer  Society  Distinguished  Visitor  (1991, 1992) 


7.  Reprints  of  Publications 


REFERENCE  NO.  1 


Shang,  W.  and  Fortes,  J.  A.  B.,  "On  the  Optimality  of  Linear  Schedules,"  Journal  of 
VLSI  Signal  Processing,  Volume  1,  Number  3,  November  1989,  pp.  209-220. 

Note  -  This  paper  shows  how  to  compare  the  optimal  linear  schedule  for  a  single  out¬ 
put  computed  by  a  recurrence-like  algorithm.  These  schedules  and  algorithms  capture 
the  majority  of  systolic  array  algorithms.  It  is  also  shown  that  optimal  linear 
schedules  are  very  close  to  the  best  possible  schedules  and,  in  many  cases,  just  as 
good.  These  results  are  relevant  to  the  problem  of  rescheduling  systolic  algorithms  in 
arrays  with  faulty  components. 


Journal  of  VLSI  Signal  Processing,  I,  209-220  (1989) 
©  1989  KJuwer  Academic  Publishers,  Boston.  Manufactured  in  The  Netherlands. 


On  the  Optimality  of  Linear  Schedules 

WEUIA  3HANG  AND  JOSE  A.B.  FORTES 

School  of  Electrical  Engineering,  Purdue  University.  Wist  Lafayette,  IN  47907 
Received  November  8.  1988 

Abstract.  An  algorithm  can  be  modeled  as  a  set  of  indexed  computations,  and  a  schedule  is  a  mapping  of  the  algo¬ 
rithm  index  space  into  time.  Linear  schedules  are  a  special  class  of  schedules  that  are  described  by  a  linear  mapping 
and  are  commonly  used  in  many  systolic  algorithms.  Free  schedules  cause  computations  of  an  algorithm  to  execute 
as  soon  as  their  operands  are  available.  If  one  computation  uses  data  generated  by  another  computation,  then  a 
data  dependence  exists  between  these  two  computations  which  can  be  represented  by  the  difference  of  their  indices 
(called  dependence  vector).  Many  important  algorithms  are  characterized  by  the  feet  that  data  dependencies  are 
uniform,  i.e. ,  the  values  of  the  dependence  vectors  are  independent  of  the  indices  of  computations.  There  are  appli¬ 
cations  where  it  is  of  interest  to  find  an  optimal  linear  schedule  with  respect  to  the  time  of  execution  of  a  specific 
computation  of  the  given  algorithm.  This  paper  addresses  the  problem  of  identifying  optimal  linear  schedules  for 
uniform  dependence  algorithms  so  that  the  execution  time  of  a  specific  computation  of  the  algorithm  is  minimized 
and  proposes  a  procedure  to  solve  this  problem  based  on  the  mathematical  solution  of  a  linear  optimization  problem. 
Also,  linear  schedules  are  compared  with  free  schedules.  The  comparison  indicates  that  optimal  linear  schedules 
can  be  as  efficient  as  fixe  schedules,  the  best  schedules  possible,  and  identifies  a  class  of  algorithms  for  which 
this  is  always  true. 


1.  Introduction 

The  algorithms  under  consideration  in  this  paper  are 
characterized  by  uniform  data  dependencies  and  unit¬ 
time  computations;  they  include  those  described  by 
single  uniform  recurrences  [1]  and  resemble  a  very  large 
number  of  systolic  computations  and  algorithms  de¬ 
scribed  by  programs  with  nested  loops.  Informally,  an 
algorithm  is  represented  in  this  paper  as  a  partially 
ordered  subset  of  a  multidimensional  integer  lattice 
(called  index  set).  The  points  of  this  lattice  correspond 
to  (i.e.,  are  the  indices  of)  computations,  and  the  partial 
order  reflects  the  data  dependencies  between  them. 
These  data  dependencies  are  represented  as  vectors  that 
connect  points  of  the  lattice.  If  a  given  dependence  vec¬ 
tor  is  always  present  when  the  vector  difference  between 
any  two  lattice  points  equals  the  dependence  vector, 
then  the  dependence  is  said  to  be  uniform.  If  all 
dependences  are  uniform  then  the  algorithm  is  said  to 
be  a  uniform  dependence  algorithm. 

Thu  twieh  supported  in  pan  by  the  National  Science  Founda¬ 
tion  under  Giant  DO-8419743  and  in  pan  by  the  Innovative  Science 
and  Technology  Office  at  the  Strategic  Defeats  Initiative  Organiza¬ 
tion  and  aaa  adma-ared  through  the  Office  of  Naval  Research  under 
contracts  No.  0OOV4-85-k-O588  and  No.  OOOI4-88-k-0723. 


A  linear  schedule  is  a  mapping  from  the  multidi¬ 
mensional  algorithm  index  set  into  the  one-dimensional 
time  space;  this  mapping  is  expressed  as  a  linear  trans¬ 
formation  that  involves  the  multiplication  of  a  vector, 
called  linear  schedule  vector,  by  each  and  every  point 
of  the  index  set.  The  image  of  the  index  point  under 
the  mapping  is  the  time  of  execution  of  the  computation 
indexed  by  that  point.  This  algorithm  model  and  the 
notion  of  linear  schedule  are  easily  related  to  similar 
models  and  concepts  used  in  [1]— [13]  and  several  other 
works. 

There  are  two  formulations  of  optimal  linear  sched¬ 
ule  problem.  In  one  case,  the  execution  time  of  the  set 
of  all  computations  of  the  algorithm  is  to  be  minimized. 
In  [14],  a  procedure  is  provided  to  find  an  optimal  solu¬ 
tion  for  this  problem  for  any  uniform  dependence  algo¬ 
rithms  whose  index  sets  are  convex  polyhedra.  In  some 
applications,  given  an  algorithm,  it  is  of  interest  to  find 
an  optimal  linear  schedule  with  respect  to  the  execution 
time  of  one  specific  computation.  Clearly,  to  execute  this 
specific  computation,  all  computations  on  which  that 
specific  computation  depends  must  execute  first.  This 
gives  the  second  formulation  of  optimal  linear  sched¬ 
ule  problem  where  the  goal  is  to  minimize  the  execu¬ 
tion  time  of  a  specific  computation  of  the  algorithm. 


210  Shang  and  Fortes 


rree  schedule  schedules  computations  to  execute 
as  soon  as  their  operands  are  available.  The  total  exe¬ 
cution  time  that  results  from  using  a  tree  schedule  is 
an  exact  lower  bound  tor  the  execution  time  of  the 
algorithm.  In  [1]  and  [12]  the  execution  times  achievable 
by  linear  schedules  and  free  schedules  are  compared. 
The  difference  between  the  execution  time  achieved  by 
he  free  schedule  and  the  execution  time  achievable  by 
in  optimal  linear  schedule  is  bounded  by  a  constant 
[1].  In  [12],  it  was  found  that  the  difference  is  equal 
to  either  one  or  zero  for  a  set  of  25  algorithms. 

In  this  paper,  the  second  formulation  of  the  linear 
schedule  problem  is  considered  and  a  procedure  is  pro¬ 
posed  to  find  the  linear  schedule  that  minimizes  the  ex¬ 
ecution  time  of  a  specific  computation  of  the  algorithm. 
Also,  the  results  reported  in  [1]  and  [12]  on  the  compar¬ 
ison  of  linear  schedules  and  free  schedules  are  extended. 
A  class  of  algorithms  is  identified  for  which  the  differ¬ 
ence  between  the  execution  times  achieved  by  free 
schedules  and  linear  schedules,  respectively,  is  always 
zero.  This  means  that  linear  schedules  can  achieve  the 
execution  time  lower  bound  for  this  ciass  of  algorithms. 

This  paper  is  organized  as  follows.  The  basic  termi¬ 
nology  and  definitions  used  throughout  the  paper  are 
introduced  in  Section  2  as  well  as  the  formulation  of 
the  problem  of  determining  an  optimal  linear  schedule 
when  the  execution  time  of  a  single  specific  computa¬ 
tion  is  to  be  minimized.  Section  3  is  dedicated  to  tne 
solution  of  this  problem.  Linear  schedules  are  com¬ 
pared  to  free  schedules  in  Section  4.  and  sufficient  con¬ 
ditions  are  provided  for  an  algorithm  to  have  an  optimal 
linear  schedule  which  has  the  same  execution  time  as 
the  free  schedule.  Section  5  is  dedicated  to  conclusions. 

2.  Terminology  and  Definitions 

Throughout  this  paper,  sets,  matrices  and  row  vectors 
are  denoted  by  capital  letters,  column  vectors  are  rep¬ 
resented  by  lower  case  symbols  with  an  overbar.  and 
scalars  correspond  to  lower  case  letters.  The  transposes 
of  a  vector  v  and  a  matrix  M  are  denoted  vT  and  Mr. 
respectively.  The  symbol  E,  denotes  the  row  vector 
whose  entries  are  all  zeroes  except  that  the  ith  entry 
is  equal  to  unity.  The  vector  1  (or  0)  denotes  the  row 
vector  or  column  vector  whose  entries  are  ail  ones  tor 
zeroes).  The  dimensions  of  vectors  1  and  0  and  whether 
they  denote  row  or  column  vectors  are  implied  by  the 
context  in  which  they  are  used.  The  symbol  /  denotes 
the  identity  matrix.  The  rank  of  a  matrix  A  is  denoted 
rank{A).  The  set  of  rational  numbers,  the  real  space 


and  the  set  of  integers  are  denoted  Q,  IR  and  Z.  respec- 
ively.  The  set  of  non-negative  integers  and  the  set  of 
positive  integers  are  eenctsd  N  and  N* ,  respectively. 
The  empty  set  is  denoted  X>.  The  symbol  [_a  J  denotes 
the  greatest  integer  tir.r  is  less  than  or  equal  to  a.  The 
greatest  common  divisor  of  integers  a ,,  ..  ..  j,  is 

denoted  gcd(a,, _ an).  As  a  final  remark,  if  x  is  an 

element  of  a  set  S,  the  notation  x€S  is  used  and  this 
notation  is  “abused”  to  indicate  also  that  a  column  vec¬ 
tor  rhj  (or  row  vector  M,)  is  a  column  (  row)  of  a  matrix 
M.  i.e..  mj£M(Mj€M)  means  rhj  (A/,)  is  a  column  (row) 
vector  of  matrix  M. 

The  algorithms  of  interest  in  this  paper  are  the  so- 
called  uniform  dependence  algorithms  defined  as 
follows. 

Definition  2J.  ( Uniform  dependence  algorithm):  A 
uniform  dependence  algorithm  is  an  algorithm  that  can 
be  described  by  an  equation  of  the  form 

vij)  =  S7(v(y  ~  ^t).  ''O'  ~  <*i) . 

.  .  (2.1) 

v(j  -  dj) 

where 

1.  ye/CZ"  is  an  index  point  (a  column  vector),  J  is 
the  index  set  of  the  algorithm,  and  ngiV"  is  the 
number  of  components  of  y; 

2.  gf  is  the  computation  indexed  by  y,  i.e.,  a  single¬ 
valued  function  computed  "at  point  j  "  in  a  single 
unit  of  rime ; 

3.  v(y)  is  the  value  computed  “at  j,”  i.e..  the  result  of 
computing  the  right  hand  side  of  (2.1)  and 

4.  d.fiZn.  i  =  l . m,  m€N  are  dependence  vectors. 

also  called  dependencies,  which  are  constant  (i.e.. 
independent  of  jU);  the  matrix  D  =  [d,.  . ...  dm] 
is  called  the  dependence  matrix. 

The  class  of  uniform  dependence  algorithms  is  a 
simple  extension  of  the  ciass  of  computations  described 
by  uniform  recurrence  equations  [1].  The  main  differ¬ 
ence  is  that  uniform  dependence  algorithms  allow  for 
different  functions  to  be  computed  (in  a  unit  of  time) 
at  different  points  of  the  index  set.  From  a  practical 
viewpoint,  uniform  dependence  algorithms  can  be  easily 
related  to  programs  where  (1)  a  single  statement  appears 
in  the  body  of  a  multiply  nested  loop  and  (2)  the  indices 
of  the  variable  in  the  left-hand  side  of  the  statement  dif¬ 
fer  by  a  constant  from  the  corresponding  indices  in  each 
reference  to  the  same  variable  in  the  right-hand  side. 
Alternative  computations  can  occur  in  each  iteration 
as  a  result  of  a  single  conditional  statement  as  long  as 
data  dependencies  do  not  change.  .Vested  loop  programs 


On  the  Optimality  of  Linear  Schedules  211 


with  multiple  statements  can  also  use  the  techniques 
of  this  paper  together  with  d;e  alignment  method  dis¬ 
cussed  in  [IS]  and  [16]. 

For  the  purpose  of  this  paper,  only  structural  infor¬ 
mation  of  the  algorithm,  i.e.,  the  index  set  J  and  the 
dependence  matrix  D.  is  needed.  Other  information, 
such  as  what  computations  occur  at  different  points  and 
where  and  when  input/output  of  variables  takes  place, 
can  be  ignored.  Therefore,  a  uniform  dependence  algo¬ 
rithm  with  index  set  J  and  dependence  matrix  D  is  here 
on  characterized  simply  by  the  pair  (J,  D ).  It  is  also 
assumed  that,  as  in  Definition  2.1.  the  letters  n  and  m 
always  denote  the  dimension  of  index  points  in  J  and 
the  number  of  dependence  vectors,  respectively.  The 
following  example  illustrates  the  concept  of  uniform 
dependence  algorithms. 

Example  2d.  Consider  the  following  uniform  depen¬ 
dence  algorithm: 

Aji,  A)  =  g(x{jl  -  1,  ji  +  3), 

Aj i  ~  -.7;  “  *).  Aji  ~  -.A)> 

where 

j\  =0.  ...2s 

TO.  ....  Is  -y,  if sSjxs2s 

h  =  S 

1/  -7i . 2j  -7,  if0S7,Sr 

The  index  set  J  of  this  algorithm  is  shown  in  Figure 
1.  The  reader  can  verify  easily  that  it  can  be  described 
as  7  =  {7  =  [/',,  7J r:  Ajsb.JGZ2}  where 


The  dependence  matrix  is  D  =  [du  d2,  j3]  where 
dx  -  [1.  -3]r,  d,  =  (2.  4][  and  <i,  =  (2.  0]r 

The  dependence  vectors  du  i  =  1 . m  induce 

a  partial  ordering  on  the  index  set  J.  This  ordering, 
denoted  ae.  can  be  easily  described  in  terms  of  the  algo¬ 
rithm  precedence  graph.  Informally,  the  set  of  nodes 
of  the  precedence  graph  is  the  index  set.  and  if  compu¬ 
tation  indexed  by  7,  depends  directly  on  computation 
indexed  by  jit  then  there  is  a  directed  edge  from  node 
7 1  to  node  7, .  The  formal  definition  is  given  next. 

Definition  2.2.  (Algorithm  precedence  graph  and 
directed  path):  The  precedence  graph  of  algorithm 
iJ.  D)  is  the  directed  graph  (J.  £)  where  J  is  the  set 


Fig.  1.  The  index  set  of  the  algorithm  of  Example  2.1. 

of  nodes  of  the  graph  and  £  =  {(J.  j'): 
j'  -  j  =  dt,  dtD.  j,  7'6/}  is  the  set  of  edges. 
If  (j.j^ZE,  then  j  and  j'  are  called  the  tail  and  the  head 
of  edge  (j,  j\  respectively.  If  there  exists  a  set  of  in¬ 
dex  points  7i,  ....  7/  such  that  (j,  7,),  (7,,  jj),  1.:. 
(j,J*)eE.  then  7  is  connected  to  j*.  The  tuple  L  *  (j,  7, , 
•  •  •  •  7/.  7*)  is  called  a  directed  path  and  its  length  is 
the  number  of  edges  on  the  path,  i.e..  /  +■  1. 

The  partial  ordering  <x  induced  by  the  dependence 
vectors  on  the  index  set  J  is  such  that,  if  7, 7 '67.  then 
jacj'  if  and  only  if  there  exists  a  directed  path  from  7 
to  j’  in  the  precedence  graph  of  the  algorithm. 

Definition  2.1  (Schedules):  A  schedule  for  algorithm 
(J,  D)  is  a  function  <j:J  —  Z  which  is  strictly  monotone 
increasing  with  respect  to  the  ordering  oc  induced  by 
dh  i  =  1 . m  in  J.  i.e..  if  7*7'  (computation  in¬ 

dexed  by  7 '  depends  on  computation  indexed  by  j),  then 
<7(7)  <  <rij'). 

In  other  words,  a  schedule  is  a  mapping  which  assigns 
a  time  of  execution  to  each  computation  of  the  algorithm 
in  such  a  way  that  dependencies  are  preserved,  i.e.. 
if  the  computation  indexed  by  /'  depends  on  the  com¬ 
putation  indexed  by  J,  then  computation  indexed  by  7' 
can  be  executed  only  after  the  execution  of  computation 
indexed  by  j.  Examples  of  schedules  of  interest  in  this 
paper  include  free  schedules  and  linear  schedules,  both 
of  which  are  defined  next. 


212  Shang  and  Fortes 


Definition  2  A.  (Fr: e  schedule):  "or  algorithm  (J.  D) 
:he  free  schedule  >.r  a  mapping  of  J  —  Z  such  that 

fO  if  idjZD.j-d.tJ 

afU)  -  1  .... 

Ijnax  {oy(/— «/,-):  j-d^J,  i=  1 . /n}-*-l 

.f  3d,€A  j—djU 

Definition  2.5.  For  algorithm  (7.  D)  a  schedule  on: 
J  —  Z  is  a  linear  schedule  if  anO)  =  !.n/  +  cj  .  j€J. 
where  n€Qix''  is  such  that  minjILi;:  dj€D)  =  1  and 
c  =  -min  {II/:  jU\. 

Definitions  2.3  and  2.4  are  similar  to  equivalent  defi¬ 
nitions  in  [1] .  Free  schedules  are  the  fastest  schedules 
possible,  and  the  total  execution  time  achieved  by  free 
schedules  is  the  exact  lower  bound  of  the  execution  time 
of  any  algorithm.  It  is  relatively  simple  to  verify  that 
both  tree  schedules  and  linear  schedules  satisfy  Defini¬ 
tion  2.3,  i.e..  they  preserve  data  dependencies.  In  the 
last  definition,  the  entries  of  II  can  be  any  rational  num¬ 
ber.  However,  given  such  a  II,  it  is  always  possible  to 
find  II'€Z,X"  and  a  constant  dispU.'  6  Af  such  that 
II  =  U'/dispn'  where  dispTL'  equals  the  least  common 
multiple  of  the  denominators  of  the  entries  of  II.  For 
this  reason,  the  following  definition  of  a  linear  schedule 
is  equivalent  to  that  of  Definition  2.5  and  is  used 
throughout  this  paper. 

Definition  2.6.  (Linear  schedule  and  linear  schedule 
vector):  For  algorithm  ( J .  D)  a  linear  schedule  is  a 
function  on:  J  —  N  such  that 

and)  =  \jm  +  c)/dispFL\ .  j€J  (2.2) 

where  Il6Zlx".  disptt  =  min  {lid /  :  d,£D }  >  0. 
gcdi t,,  . .  . .  x„)  =  1  andc  =  -  min{II/:  jO).  The 
row  vector  II  is  called  linear  schedule  vector  associated 
with  an- 

In  some  applications,  given  an  algorithm  (J,  D),  it 
is  of  interest  to  find  an  optimal  linear  schedule  with 
respect  to  the  time  of  execution  of  one  specific  compu¬ 
tation.  Let  p  be  a  point  in  J.  If  there  exists  a  directed 
path  from  point  j  to  p  in  the  precedence  graph  of  the 
algorithm  (J.  D),  j  is  a  predecessor  of  p.  Clearly,  to 
execute  die  computation  indexed  by  a  all  the  computa¬ 
tions  indexed  by  the  predecessors  of  p  have  to  be  exe¬ 
cuted  first.  Let . . .  ,jt  be  all  the  predecessors  of  p. 

Hip )  be  the  set  of  /, . jt  and  p  and  Dp  be  the 

matrix  of  dependence  vectors  that  are  present  in  Hip). 
i.e.,  d  €  DP  if  and  only  if  there  exist  two  points  ju 
ji$H(p)  such  that  jx  =  j2  +  df.  Suppose  that  there  are 
mp  dependence  vectors  in  D„.  Clearly,  Dp  is  a  sub¬ 


matrix  of  D  and  Hip)  is  a  subset  of  J.  i.e..  (Hip),  Dp) 
is  a  sub-algorithm  of  {J,  D).  Due  to  the  feet  that  the 
computation  indexed  by  p  is  the  last  to  execute,  the  total 
execution  time  t„  for  (Hip),  Dp)  with  a  linear  schedule 
rn  can  be  expressed  as 

max  {II(p  -  j):  jtHip)} 

min  {ndji  d£Dp)  J  ”‘J 

Because  tp  is  minimum  if  the  argument  of  the  floor 
function  in  Equation  2.3  is  minimized,  the  problem  of 
finding  an  optimal  linear  schedule  with  respect  to  the 
execution  of  one  specific  computation  is  formulated  as 
follows: 

Problem  2.1.  (Time  optimal  linear  schedule  problem 
for  execution  of  one  computation):  For  algorithm 
(J.  D),  let  p€J.  The  time  optimal  linear  schedule  prob¬ 
lem  for  execution  of  the  computation  indexed  by  p  con¬ 
sists  of  finding  a  linear  schedule  vector  II(,€Q|X"  such 
that  it  minimizes 

=  max  {flip  -jY.jt-Hipt}  . 

0  min  {Fid,:  d&Dp) 

subject  to  Ud,  >  0  d,€DP. 

Notice  that  if  Up  is  an  optimal  solution  of  Problem 
2.1.  so  is  all,,  for  any  non-zero  constant  a.  This  guar¬ 
antees  that  the  optimal  solution  II0  can  always  be  ob¬ 
tained  such  that  npzZ'*n  and  the  greatest  common 
divisor  of  the  components  of  II0  is  equal  to  one.  After 
n„  is  found,  then,  according  to  Definition  2.6.  the  con¬ 
stant  c  can  be  determined  and  the  corresponding  opti¬ 
mal  linear  schedule  an,  be  specified  completely. 

The  solution  of  Problem  2.1  is  discussed  in  the  next 
section.  From  here  on.  the  letters  mp  and  m'„  always 
denote  the  number  of  dependence  vectors  in  matrix  DP 
and  the  rank  of  matrix  Dp,  respectively;  and  without 
loss  of  generality,  it  is  assumed  that  Dp  contains  the 
first  mp  columns  of  A  i.e..  Dp  =  [du  .  ..  d„J. 

3.  Solution  of  Time  Optimal  Linear  Schedule 
Problem  for  Execution  of  One  Computation 

This  section  discusses  the  solution  of  Problem  2.1.  i.e. . 
the  problem  of  finding  an  optimal  linear  schedule  such 
that  the  execution  ume  of  the  computation  indexed  by 
a  point  pvtJ  is  minimized.  A  theorem  and  a  procedure 
are  provided  that  describe  how  to  find  the  solution  of 
Problem  2.1.  The  long  proof  of  the  theorem  is  pro¬ 
vided  in  Appendix.  However,  the  idea  of  the  proof  is 


On  che  Optimality  of  Linear  Schedules  2 13 


described  informally  and  examples  are  given  to  illus- 
race  the  main  concepts. 

Let  D(cx . .  -Cj/rx . .  .ry)  denote  the  submatrix  of  D 
:ontaining  the  elements  in  columns  c,,  . . cx  and 
rows  ru  . . . ,  ry,  i.e.,  it  contains  the  elements  of  D  at 
the  intersections  of  columns  c,,  . . . ,  ct  and  rows  rx, 
. . . ,  rv,  respectively.  If  D(c, . .  ■ck/rx  .../*)  is  nonsin¬ 
gular,  an  integer  row  vector  V  =  [vt,  . . v*]€Zlxi: 
is  defined  as  V  =  /3LD-l(crt . .  .c*/r, . .  .rk)  where  0  is  a 
positive  integer  constant  such  that  gcd(vx,  . . . ,  vk)  =  1. 
In  other  words.  Visa  vector  whose  entries  are  the  sums 
of  the  corresponding  columns  of  0-l(c, . .  .ckJrx  .../*) 
scaled  so  that  they  are  integers  with  the  greatest  com¬ 
mon  divisor  equal  to  the  unity.  If  D(cx. .  . c k/rx .../■*) 
is  nonsingular,  then  II(c, . . . ck/r{ .  .  .rk)  =  VB€ZiXn, 
where 


In  words,  the  subvector  [*r,,  ....  7r„J  of  life, .  .  .c*/ 
r, . . .  r*)  is  the  same  as  V  and  the  remaining  entries  of 
life, . .  ,ckJrx  .../*)  are  zero.  Finally,  C®  denotes  the 
set  {life, . .  .cmJrx . .  ,rn  J:  1  s  c,  <  . . .  <  cm^  £ 
m„  1  s  r,  <  . . .  <  rm.a  £n).  The  following  example 
illustrates  the  notation  and  concepts  just  introduced. 


Example  3.1.  Consider  the  algorithm  (/,  D)  of  Exam¬ 
ple  2.1  where  index  set  J  is  shown  in  Figure  1  and 


1  22 


D  = 


-3  4  Oj 


According  to  the  definitions  just  introduced,  0(1/1)  =  [1] 
(contains  the  element  at  the  intersection^  of  column  1 
and  row  1  of  D),  its  corresponding  V  -  0\D~x(VX)  =  [1J 
where  0  =  1.  B  -  Ex  =  (1.  0]  and  11(1/1)  *  VB  =  [1,  0]; 
similarly,  D(  1/2)  =  [-31,  its  corresponding  V  - 
0\D-\M2)  =  [-1]  where  0  -  3,  B  =  £,  «  [0,  1]  and 
n(l/2)  =  VB  =  [a  -11;  and 


D(12/12)  = 


l  2 
-3  4 

ml 


its  corresponding  V  =  3\D~'  =  (7.  -1]  where  0  =  10, 
B  -  l  and  n(12/12)  =  VB  =  V  =  [7,  -1], 

Without  loss  of  generality  let  Dp  -  [dx,  . . . , 
ronk(Da)  =  m'p  £  ma.  The  next  theorem  states  that 
C®  is  the  candidate  set  for  (Hip),  D„),  i.e.,  n„€C®. 


Theorem  3.1.  Consider  algorithm  (J,  D)  and  let  pU. 
An  optimal  solution  Ilp  of  Problem  2.1  for  (Hip),  Da) 
belongs  to  C®,  i.e., 

n,6C®  =  {life,. .  ,cm'Jrx. .  .rm.y. 

l^c,<  . . .  (3.1) 

1  <r  <  . .  .<rm^<n} 

Proof.  Provided  in  Appendix. 

The  idea  of  the  proof  is  as  follows.  The  tact  that  the 
computation  indexed  by  p  is  the  last  to  execute  is  explored 
and  Problem  2.1  is  reformulated  as  a  linear  program¬ 
ming  problem.  Its  dual  is  considered.  For  m'p  =  n, 

it  is  shown  that  G  -  [dCx,  —  dcJ,  where  dc< . 

dc  €DP,  is  an  optimal  basis  for  the  dual  problem  and 
lG-1  =  na  is  an  optimal  solution  of  Problem  2.1.  So 
nfl  =  11(0, . .  .cjl.  •  n)  and  it  belongs  to  C®.  If  m'p  < 
n,  H(p)  is  transformed  into  an  -dimensional  space 
by  a  bijective  linear  mapping  r  specified  by  a  matrix 
TS.Zm'**n.  By  applying  the  same  reasoning  in  the  case 
where  m'p  =  n,  it  is  shown  that  IIfl  =  IKc, . .  .cmJ 
r, . .  ,rm'J  for  some  values  of  r, . V3-  "The  candi¬ 

date  set  C®  is  constructed  by  the  following  procedure. 


Procedure  3.1.  (Construction  of  CP): 

Input:  Algorithm  (Hip),  DJ 

Output:  A  finite  candidate  set  C®  containing  the  opti¬ 
mal  solution  of  Problem  2.1. 

Step  1:  C®  =  0  and  /  =  1. 

Step  2:  Pick  up  a  set  of  m'p  linearly  independent  col¬ 
umns  from  Dp. 

Step  3:  Let  the  set  of  m'p  linearly  independent  col¬ 
umns  being  processed  be  de<,  ....  dCm.  .  Find 
a  matrix 


B  = 


1  £f,<. .  .</V#SSrt. 


such  that  D(cx. .  .cm.Jrx...rma)  =  B[dCl, 
....  dCm,  ]  is  not  singular  and  set  Hf  = 
IKc, .  .  Z  j  r\  ■  ■  rm„).  C®  =  C®U  {rt/},  l 


=  1+1. 

Step  4:  Check  if  ail  distinct  sets  of  m'p  linearly  inde¬ 
pendent  columns  from  Dp  have  been  proc¬ 
essed.  If  not.  pick  up  an  unprocessed  set  and 
go  to  Step  3.  Otherwise,  stop. 

Cleariy,  Step  3  dominates  the  whole  procedure.  For 
Step  3,  each  iteration  needs  at  most  oper¬ 

ations  to  find  matrix  B  such  that  B[dei,  ....  dCm.  ]  is 
non-singular  and  there  are  at  most  (J*)  iterations’.  So 


214  Jhang  and  Fortes 


the  complexity  of  Procedure  3.1  is  bounded  above  by 
0((£  and  is  independent  of  the  size  of 

the  algorithm  (i.e..  does  not  depend  on  the  cardinality 
of  H(p)).  If  m',,  =  m'  =  n,  then  the  complexity  is 
0(/jJ).  The  following  example  shows  how  to  construct 
Ca  according  to  Procedure  3.1. 


Example  2.2. 
where 

D  = 


Consider  the  algorithm  of  Example  2.1 
’12  2' 

-3  4  0 


Figure  2  shows  the  index  set  J  when  s  =  6.  Let  p  = 
[U.  0]r.  Then  Hip)  -  {[U.  0]r,  [9.  Of ,  [7. 0]r,  [8.  3]r , 
[6.  3f,  [4.  3f ,  [5. 6f .  [3. 6]r.  [1. 6f ,  [2. 9]r.  [0.  9)^} . 
The  set  Hip)  contains  points  marked  by  in  figure 
2.  Only  dependence  vectors  [1,  -3f  and  (2,  Of  are 
present  in  Hip),  i.e., 

r  i  2i 


Da 


-3  0 


.1 


There  is  only  one  combination  of  two  dependence  vec¬ 
tors  from  Da  which  determines  a  vector  11(13/12)  = 
(3.  -I].  Therefore.  C*  =  {[3,  -1]}  and  II,  *  [3.  -1]. 
According  to  [14],  the  optimal  linear  schedule  for  ( J , 
D)  is  11(12/12)  which  is  different  from  nB.  This  implies 
that,  in  general,  the  optimal  linear  schedule  for  (J.  D) 
is  not  necessarily  the  same  as  the  one  for  (Hip),  Dt) 
where  p$J.  The  total  execution  time  for  {Hip),  D0)  by 
n„  is  r(I!0)  =  (Tn  ip)  -  <rna([0.  9]r)  +1=8.  Notice 
that  the  total  execution  time  by  the  free  schedule  for 
(Hip),  Da)  is  also  8  units,  i.e.,  the  total  execution  time 
by  the  optimal  linear  schedule  c?na  is  equal  to  the  total 
execution  time  by  the  free  schedule.  So.  the  optimal 
linear  schedule  achieves  the  lower  bound  of  the  execu¬ 
tion  time  of  (Hip),  D„)  in  this  example. 

4.  Comparison  of  Execution  Times  by  Optimal 
Linear  Schedules  and  Free  Schedules 

This  section  compares  the  execution  time  achievable 
by  optimal  linear  schedules  with  that  of  free  schedules. 
First,  a  sufficient  condition  on  the  dependence  matrix 
D  of  an  algorithm  is  provided  which  guarantees  that 
both  schedules  achieve  the  same  execution  times  for 
a  specific  computation  of  the  algorithm.  Afterwards, 
the  execution  time  of  all  computations  of  the  algorithm 
is  briefly  considered.  It  is  shown  by  examples  that,  for 
the  last  case,  both  the  size  and  shape  of  the  index  set 


Fif.  2.  The  index  set  of  the  algorithm  ot'  Example  3.1  when  s  =*  6. 
Hip)  is  the  set  of  points  which  are  marked  by  and  connected 
to  p  by  dependence  vectors  (1.  -3]r  and  (2.  Of. 

determine  how  linear  schedules  compare  with  free 
schedules. 

Let  r(IL  J,  D),  tiof,  J,  D)  denote  the  total  execution 
time  for  algorithm  (J,  D)  achieved  by  the  linear  schedule 
on  and  by  the  free  schedule  af,  respectively.  Clearly,  if 
/  equals  the  length  of  the  longest  path  in  (J,  £),  the  algo¬ 
rithm  precedence  graph  of  algorithm  (J,  D)  defined  in 
Definition  2.2.  then  t(of,  J,  D)  =  /  +  1.  Let  kiH(p), 
D„)  =  tin,.  Hip),  Dp)  -  t(Of,  Hip),  Dp),  where 
is  the  optimal  linear  schedule  for  algorithm  (Hip),  Da) 
and  can  be  found  by  Procedure  3.1.  Karp,  et  al..  [1]  have 
shown  that  k(Hip),  Dp),  where  p€J.  is  bounded  above 
by  a  constant.  However,  as  they  pointed  out.  to  find 
the  constant  is  still  open.  In  [12],  25  algorithms  were 
studied  and  it  was  found  that  the  difference  k(J,  D)  is 
equal  to  either  one  or  zero  for  those  algorithms. 

The  following  theorem  provides  conditions  on  the 
dependence  matrix  of  a  given  algorithm  which  guaran¬ 
tee  that  an  optimal  linear  schedule  for  a  given  com¬ 
putation  yields  the  same  execution  time  as  the  tree 
schedule. 

Theorem  4.1.  Without  the  loss  of  generality,  let 
Dp  =  [d„  ....  imp]-  If  rank{[d)  -  d2,  dt  -d1t  .... 

~  <  nmk(Dp),  then  k((H(p),  D0)  =  0. 


On  the  Optimality  of  Linear  Schedules  215 


Proof.  Provided  in  Appendix. 

Theorem  4.1  applies  to  Example  3.2  where  k(H(p), 
Dj)  =  0,  p  =  [11.  Of.  Next,  it  is  shown  in  two  exam¬ 
ples  that,  lc(J,  D ).  the  difference  between  execution 
imes  of  algorithm  ( J .  D)  achieved  by  optimal  linear 
schedules  and  free  schedules,  depends  on  the  shape  and 
size  of  the  index  set  of  the  algorithm.  In  other  words, 
the  result  of  Theorem  4.1  does  not  generalize  to  the 
case  when  all  computations  are  considered. 


Example  4.1.  Consider  algorithm  (J.  D)  where 
J  -  ili\,  Jzlr-  0  £  juj2  £  s.j„j2€Z }  snd 


D  = 


3  0 
-3  2 


Let  n°  be  the  linear  schedule  vector  by  which  the  total 
execution  time  of  the  whole  algorithm  is  minimized.  By 
Procedure  3.1  in  [14],  11°  =  [5.  3].  Figure  3a  depicts 
the  index  set  where  s  =  8.  The  label  at  each  index  point 
j  is  equal  to  oy(/),  and  the  parallel  lines  correspond  to 
the  hyperplanes  <7n»03  -  L([5.  3]/')/6j  =  c,.  If  an 
index  point  j  lies  on  the  line  ojj »(j)  =  c,  or  between 
lines  an •(/)  =  C|  and  an<,( j)  -  c,  +  l.  then  the  compu¬ 
tation  indexed  by  j  is  assigned  to  execute  at  time  step  c, 
by  the  optimal  linear  schedule  an®.  It  can  be  verified 
that  when  s  =  6.  k(J,  D)  ~  0  and  when  s  =  8,  k(J, 
D)  ~  10  -  9  =  1.  So  k(J.  D)  is  a  function  of  the  size  s. 
Actually,  regardless  of  the  value  of  s.  k(J,  D)  is  bounded 
above  by  constant  \_(U°[2. 0]r)/6J  =  l.  The  derivation 
of  this  upper  bound  can  be  done  heurishcaily  as  fellows. 
Clearly,  the  last  computation  scheduled  to  execute  by 
the  optimal  linear  schedule  an®  is  indexed  by  the  index 
point  p  =  [r.  rf  Consider  (H(p).  D„)\  because  D0 
satisfies  the  condition  in  Theorem  4.1.  k(H(p),  D0)  =  0. 
.Also.  n„  =  n°  due  to  the  feet  that  D0  =  D.  So.  kU. 
D)  =  an®4u)  where  u  is  an  initial  point  (u  is  an  initial 
point  if  Ofiu)  -  0)  such  that  an®<u)  =  min{an®(«,)  =  0, 
crAu)  =  0.  Uj€H(p)} .  There  are  six  initial  points  in/. 
If  the  initial  point  u  is  not  easy  to  identify,  one  pos¬ 
sible  upper  bound  for  k(J,  D)  is  an«<[2.  l]r)  = 
max{an®Oi():  a^«,)  =  0.  uf€7 }  2  pjj»(u).  As  shown 
in  [17],  there  is  only  one  initial  point  connected  to  p. 
and.  regardless  of  die  value  of  s.  initial  point  [2.  if 
does  not  belong  to  H(pi)  (due  to  the  feet  IFp,  if  * 
n°p(mod6)).  So.  the  upper  bound  an®<[2,  if)  is  not 
tight.  It  can  be  verified  that  the  initial  point  [2.  Of 
belongs  to  H(p)  (i.e.,  n®[2,  0]r  =  n°p(mod6))  when 
s  is  such  that  3  divides  ■is  -  2  (i.e..  s  -  2,  5.  8,  11 
and  soon).  Therefore.  k(J.  D)  s  an®<[2, 0]r)  =  l  and 
when  s  =  2  +  3r.  ctZ,  k(J,  D )  =  an®([2,  0]r)  =  l. 


(») 


(*») 

Fif.  3.  The  index  sea  of  the  algorithms  of  Example  4.1  (a)  and  Exam¬ 
ple  4.2  (b).  The  label  at  each  point  equals  the  time  step  of  execution 
assigned  by  the  free  schedule.  The  parallel  lines  indicate  the  time 
step  of  execution  assigned  by  the  optimal  linear  schedule.  All  points 
lying  on  the  line  or  between  lines  <tn«</)»c  and 

are  executed  at  time  step  c. 

Example  4.2.  Consider  the  algorithm  that  has  the 
same  dependence  matrix  as  Example  4.1  and  J  = 

{j:  Aj  S  b.  ;€Z2}  where 


’-I  1  ' 

'  0  ", 

] 

A  = 

0  -1 

and  b  = 

0 

_  1  0 

s 

216  Shang  and  Fortes 


The  index  set  is  shown  in  Figure  3b  (for  s  =  i)  and 
has  a  shape  different  from  that  of  Example  4.1.  3y  Pro¬ 
cedure  3.1  in  (14],  n°_=  [5,  3].  The  linear  schedule 
specified  by  II0  is  a^pij)  =  |_([5,  3]/)/6J  .  When  s  =  6, 
c(J.  D)  =  3  and  when  s  =  8.  k(J,  D)  =  10  -  6  =  4. 
..e..  .or  the  same  sizes  considered  in  Example  4.1  k(J, 
D)  has  different  values  due  to  the  different  index  set 
>hape.  Regardless  of  the  value  of  s,  ic(J,  D)  is  bounded 
above  by  [_(II0[5,  l]r)/6J  =  4.  The  derivation  of  this 
upper  bound  is  similar  to  the  one  presented  in  Exam¬ 
ple  4.1. 


5.  Concluding  Remarks 

In  summary,  this  paper  provides  a  solution  to  the  prob¬ 
lem  of  identifying  an  optimal  linear  schedule  for  the 
execution  of  one  specific  computadon  of  algorithms 
with  uniform  dependences.  Also,  a  class  of  algorithms 
for  which  the  optimal  linear  schedule  is  as  efficient  as 
the  best  possible  schedule  (i.e.,  it  is  as  fast  as  the  free 
schedule)  is  identified. 


6.  Appendix 

Proof  of  Theorem  3.1.  First,  u  is  an  initial  point  if  and 
only  if  Of{u)  =  0.  Let  uu  —  be  all  the  initial 
points  in  H(p).  Because  the  computation  indexed  by 
p  is  the  last  to  execute  and  the  computation  indexed  by 
one  of  the  initial  points  in  H(p)  is  the  first  one  to  exe¬ 
cute.  Equation  2.4  can  be  rewritten  as 

r 

min  {n  d,:  d,*D0)  (6-D 

Subject  to  IIZ),  >  0 


Let’s  consider  the  linear  schedule  vector  II  defined 
in  Definition  2.5.  i.e.,  the  linear  schedule  vector  II  such 
that  min{IId<:  df€Da]  =  l.  Then,  Equation  6.1  is 
equivalent  to 


min  f„  -  max{II(p  -  u,):  i  =  1 . r} 

i 

subject  to  U.Da  >  f 


min  /  =  t„+i 
subject  to  RD.  >  1 


n[v, . vj  2J  x„+1 


or  equivalently 


max  -  /  =  [t„,.,,  t,, - xj 


subject  to  [x„+I,  tt, . xj 


5  [-1,  0] 


0  -1  ...  -1 


Clearly,  if  [x„+1,  t„  —  x„]  is  an  optimal  solu¬ 
tion  to  linear  programming  problem  6.4,  then  (r,, _ 

xj  is  an  optimal  solution  to  (6.2),  or  Problem  2.1. 
The  dual  problem  of  Equation  6.4  is  ([19],  pp.  86): 

ft min  [-  f.  0]i 


where  xtlR)m?~s,xl .  Now,  Problem  2.1  has  been  for¬ 
mulated  as  linear  programming  problem  6.4.  Next,  the 
solution  of  linear  programming  problem  6.5  is  consid¬ 
ered  first.  The  solution  of  linear  programming  problem 
6.4  can  be  constructed  easily  from  the  solution  of  (6.5). 

Let’s  assume  that  m'a  =  nmk(Dj)  =  n.  Let  /(EL 
H(p),  D„)  denote  the  total  execution  time  achieved  by 
an  for  D„),  i.e.,  /(II,  H(p),  Da)  is  expressed 

by  Equation  2.3.  Consider  the  linear  schedule  vector 
n*€Cp  such  that  r(II'  H(pi),  D„)  =  min{/(II,  H(p). 
Dp):  n€C*}.  Without  loss  of  generality,  let  11"  = 

11(1. .  .nl l. .  ./»).  Let  G  =»  [d, . dj,  then  II"  = 

3\G~{ .  Clearly,  if  it  can  be  shown  that  II*  =  (1/0)11" 
=  1G  _1  is  an  optimal  solution  of  linear  programming 
problem  6.2,  we  are  done.  Now  let’s  assume  that  it, 
=  p  -  Vt  is  an  initial  point  to  be  executed  first  by  II*. 
i.e.,  II*u,  =  min {II*/:  j£H(p)\  and  consider  the  fol¬ 
lowing  linear  programming  problem 


Furthermore,  let  v,  =  p  —  «f,  i  =  1 . s.  Using  Tmin  IIv, 

the  approach  described  in  ([18],  pp.  96),  Equation  62  -<  (6.6) 

can  be  reformulated  as  a  linear  programming  problem:  [subject  to  II D0  a  f 


On  the  Optimality  of  Linear  Schedules  217 


:nd  its  duality 

("max  1  y 

-  .ubject  :o  D/j  =  v,  (6.7) 

v  >  0 

where  ytIR"**1  By  the  definition  of  II*  it  is  an  opti¬ 
mal  solution  of  problem  6.6.  i.e.,  G  is  an  optimal  basis 

tor  problem  6.6.  Let  Da  =  [G.  H]  CG  =  [-1 . 

-l],x„  and  CH  =  [-1 . By  the  sim¬ 

plex  procedure  ([19],  pp.  58)  the  initial  tableau  for  prob¬ 
lem  6.7  takes  form 

G  H  v,  ‘ 

_  Cq  Ch  0  _ 

If  matrix  G  is  used  as  a  basis,  then  the  corresponding 
tableau  becomes 

‘  /  G~[H  G~'vt 

6  CH  -  CcG~'H  -CgG-'v, 

where  /  is  the  identity  matrix.  Because  II*  is  an  optimal 
solution,  by  the  Optimality  Condition  Theorem  ([19], 
pp.  43),  G~'v ,  2:  6  and  CH  -  CqG~[H  2  0. 

For  linear  programming  problem  6.5,  the  initial 
tableau  takes  form 


-l  ... 

-1  -1 

V,  ... 

y*  6 

0  ... 

0  0 

o  -i 

- 

-G  v, 

o  -i  . 

» 

..  -I 

-H  v2  . 

..  vs 

CG„  =  [Co.  °1>  *  ~  ^fxin+l) 


Q  =  [Ch*  Ollxim^-t-j-D- 

If  G»  is  used  as  a  basis  for  problem  65,  the  corre¬ 
sponding  tableau  for  problem  6.5  is 


u  o  Q  -  cgg;'k  -cgg;1w  J 

After  some  manipulation  it  can  be  seen  that 
— G-Iv,  - G~ 1  " 

GZ'  = 


G-1v,  G~ 


CoGZ'K  -  [CG.  0] 

'6  1 
H  —vs 


=  -  [H*v„  n*] 


H  -V, 


*  [CgG~'h.  n*(v-  -  v,),  —  n*(Vj  -  i?,)] 
and 

cK  -  cgg;'k  =  [cH  -  CcG-'h .  n*(v,  -  vj. 

....  n*(v,  -  i?,)] 

II*m,  =  min{n*af:  i  =  l . s]  implies  that 

n*(v,  -  V;)  =  n*fa,  —  ti,)  a  _0,  /  »  1 . s.  There¬ 

fore.  CK  -  Cg  GZ'K  2:  6.  For  G^'w,  because 
G-IV|  s;  6,  it  follows 


G~'v,  G~' 


GZ'w  = 


Now  it  has  been  shown  that  CK  -  CgGZxK  2:  0 
and  GJ"'w  a  6  for  tableau  68.  Again  by  the  Optimal¬ 
ity  Conditions  Theorem  ([19],  pp.  43),  G„  is  an  op¬ 
timal  basis  for  problem  65. 

According  to  the  theorem  ([19],  pp.  91),  if  G0  is  an 
optimal  basis  for  problem  65,  then  II'  =  CgGZ 1  = 
-Cg[G~{vu  G~l ]  =  [n*v„  II*]  is  an  optimal  solu¬ 
tion  for  problem  64.  Therefore,  the  optimal  solution 
of  problem  6.2  is  IIfl  =  IT*  and  the  optimal  execution 
time  is  t,+|  =  IT*v,.  So  the  optimal  solution  of  Prob¬ 
lem  2.1  for  (H(p),  DB)  is  11(1. .  .n/l. .  .n)  which 
belongs  to  C°.  i.e..  II  a6C‘’. 

Now  let’s  discuss  the  case  where  rank(Da )  = 
m’a  <  n.  H(p)  can  be  transformed  into  an  m'B-dimen- 
sional  space  by  a  bijective  mapping.  Without  loss  of 


218  Shang  and  Fortes 


generality,  let  d„  ....  dm-a  be  linearly  independent 
and  D'p  =  [d,,  ....  ].  Three  tacts  are  proven 

next.  First,  let  TeZ'""*"  and  THip)  denote  the  set  {Tj: 
j€H(p)}.  Then  the  mapping  r:  Hip)  —  THip),  r(f) 
=  Tj,  jtHip),  is  bijective  if  and  only  rank{TD’p)  = 
m'p.  Because  rank(Dp)  =  m’p,  dt  can  be  written  as  a 

linear  combination  of  d,,  ....  d„^,  i  =  m'0  +  1 . 

mp,  :.e..  there  exists  a  matrix  T£lRn‘*,n>~m ^  such  that 
Dp  =  D n  vhere  /  is  the  identity  matrix.  If 
y,,  jiiH(p),  then  y,  =  j2  +  Dp\  where  XeZ'V  So, 
TO',  -  h)  =  TDa\  =  TD'P[I,  T]X  =  TD'P  a  where 
a  =  [/.  T]X.  If  rank{TD'p)  =  m'p,  then  equation 
TD'pa  =6  has  no  nontrivial  solution  which  means 
that  TO,  —  ft)  =  0  if  and  only  if y,  -y2  =  0.  Therefore, 
r  is  bijective. 

Secondly,  there  always  exists  such  matrix  T  that 
rank{TD'p)  =  m'p.  Because  rcmk{D'a)  =  m'p,  there 
exists  a  set  of  m’p  linearly  independent  rows  of  D'p, 
i.e..  there  exists  a  nonsingular  submatrix  D{  1. .  .m' J 
rt  -  -rm)'  1  S  r,  <  ...  <  S/i.  let 


then  m;  =  0(1 .  .  . m;/r, . . . rmf>  and  rank(TDp)  =  m'a. 

Thirdly,  U.D  =  life, . .  . .  ,rmf).  Consider  the 

algonthm  ( THip ),  TDp)  where  rank(TDp)  =  m'fl.  If  the 
same  reasoning  in  the  case  where  m'p  -  n  is  applied, 
then  A  =  LD-I(c, . .  .cmVr, . .  ./•„.)  is  the  optimal  solu¬ 
tion  of  Problem  2.1  for  ( THip ),  TDa).  If  A  is  an  opti¬ 
mal  solution  for  (THip),  TDP),  then  AT  is  an  optimal 
solution  of  Problem  2.1  for  ( Hip ),  D„).  This  is  true 
because 


max{A(7p  -  7u,):  i  -  1 . s}  = 

max{AT(p  -  ut):  i  =  I . s}  (6.9) 

The  left-hand  side  of  Equation  6.9  is  the  optimal 
value  of  the  objective  function  in  Equation  6.2  achieved 
by  A  for  ( TH(p ),  TDP).  So  the  right-hand  side  of 
Equation  6.9  should  be  optimal  for  (Hip),  Dp)  and  AT 
should  be  an  optimal  linear  schedule  vector  for  (Hip), 
DP).  Therefore,  the  optimal  solution  of  Problem  2.1 
for  {Hip),  D0)  is  II0  =  AT  =  IKc,. .  cmJry . .  ,rmJ 
which  belongs  to  C°,  i.e..  no6C“. 


Proof  of  Theorem  4.1.  First,  it  is  shown  that  there  ex¬ 
ists  a  linear  schedule  vector  n  such  that  lid,  =  . . . 

=  rw„3  *  0.  Let  H  =  [d,  -  dt,  d,  -  d3 . d,  -  dmJ, 

consider  the  following  two  equations: 

(a)  UH  =  0  and  (b)  UDP  =  0  (6.10) 


Let  Q,  and  ft-  denote  the  solution  space  of  Equation 
6. 10a  and  the  solution  space  of  Equation  6.10b.  respec¬ 
tively.  Clearly,  if  Il€ft,,  then  Fl€ft,.  so.  ft-Sft,.  Let 
dim  {ft,}.  i  =  1,  2,  denotes  he  lumoer  of  linearly 
independent  vectors  in  ft..  Because  -ank(H)  < 
rank(Dp),  dim{ft,}  =  n  —  rank ( H)  >  n  —  rank(Dp ) 
=  dim  {ft,} .  This  implies  that  ft,Cft,.  So  there  exists 
at  least  one  vector  II  belonging  to  ft,  but  not  belonging 
to  ft2,  i.e.,  II d,  =  . . .  =  II dWj  *■  0. 

The  optimal  linear  schedule  vector  IIfl  for  (Hip),  Dp) 
can  be  selected  as  follows.  If  lid,  =  . . .  =  IId„  <  0. 
the  II,,  =  -II,  otherwise  II,  =  II.  Now, 


t(Up,  Hip),  Dp) 


IIjp— minjllo/:  y€//(p)}— l  ] 
disptlp  j 


Let  |  be  such  that  11^  =  minjllo/:  j€H(p)}. 
Because  gtH(p),  there  exists  a  vector  \6Z*»  such  that 
p  =  DPX  +  g.  The  length  of  the  path  connecting  g  to 
p  is  equal  to 


so  -  rg>  =  n^x  «  disp  n0£xt  =  i-disp  n„. 

<  =  i 

Therefore. 


r(IIs,  Hip).  Dp) 


UpP  -  Iljg  -r-  I 

disp  n0 

/  disp  n„  -  1 
disp  Il0 


1 


Now.  Hip),  Dp)  s  r(II0,  Hip),  Dp)  =  l  f  I. 
However.  t[of.  Hip),  Dp)  >  (  +  I  because  there  exists 
a  path  with  length  /  +  1.  So  tiaf.  Hip),  Dp)  =  l  +■  1 
=  HUP,  Hip),  Dp),  i.e.  kiHip),  Dp)  =  0. 


List  of  Symbols 

C°:  candidate  set  for  (Hip),  Dp)\  see  Theorem 

3.1. 

D:  dependence  matrix  with  n  rows  and  m  col¬ 
umns;  see  Definition  2.1  (4). 

DP:  matrix  of  dependencies  that  are  present  in 
Hip):  see  the  paragraph  before  Problem 

2.1. 

d,:  dependence  (column)  vector  with  n  com¬ 
ponents:  see  Definition  2.1  (4). 

E:  the  set  of  edges  of  the  algorithm  prece¬ 
dence  graph;  see  Definition  2.2 


On  the  Optimality  of  Linear  Schedules  219 


row  vector  with  n  components  whose  en¬ 
tries  are  zero  except  that  ith  entry  is  1. 
f„:  objective  function  of  Problem  2.1. 
rf(pi:  iet  of  index  ooint  j>  and  all  its  predeces¬ 
sors:  see  the  paragraph  before  Problem 
:.i. 

/:  identity  matrix. 

IR:  set  of  real  numbers. 

J:  index  set:  see  Definition  2.1  (1). 
j:  column  vector:  .ndex  point:  see  Definition 
2.1  (1). 

k(J.  D):  The  difference  between  the  total  execution 
times  of  (J.  D)  achieved  by  the  optimal 
linear  schedule  and  the  free  schedule, 
respectively;  see  second  paragraph  of  Sec¬ 
tion  4. 

m:  number  of  dependence  vectors  in  D;  see 
Definition  2.1  (4). 

m'\  number  of  dependence  vectors  in  Da. 
m'a :  rank  of  matrix  Da, 

,V:  set  of  non-negauve  integers. 
iV”:  set  of  positive  integers. 
n:  number  of  components  of  index  points  in 
J :  see  Definition  2.1  (1). 

Q :  set  of  rational  numbers. 
rank(A):  rank  of  matrix  A. 
rill.  J.  D ):  total  execution  time  of  (J.  D )  by  crn;  see 
second  paragraph  of  Section  4. 
rio>,  J,  D):  total  execution  time  of  (J,  D)  by  o f,  see 
second  paragraph  of  Section  4. 
p;  a  specific  index  point  in/;  see  Problem  2.1. 
Z:  set  of  integers. 

II:  row  vector:  linear  schedule  vector:  see 
Definition  2.6. 

II0:  row  vector;  optimal  solution  of  Problem 

2.1. 

a:  a  schedule  (function);  see  Definition  2.3. 
Of-,  free  schedule:  see  Definition  2.4. 

(7n:  linear  schedule:  see  Definition  2.6. 

©:  empty  set. 

1:  a  column  or  row  vector  whose  entries  are 
all  1. 

0:  a  column  or  row  vector  whose  entries  are 
all  0. 


References 

1.  R.M.  Karp.  R.E.  Miller  and  S.  Wmograd.  The  organization  of 
computations  tor  uniform  recurrence  equations.  JACM  14.  3.  Jul. 

967.  pp.  563- 59U 

2.  0.1.  Mokicvan  and  J.A.B.  Fortes.  .Partitioning  and  mapping  algo- 
nthms  into  fixed  size  systolic  arrays.  IEEE  Trans.  Computers, 
M>1.  C-35.  No.  1.  fan.  1986.  pp.  1-12. 

3.  P.R.  Cappeilo  and  K.  Steigiitz.  Unifying  VLSI  array  designs  with 
geometric  transformations.  Proc.  Ini  l  Conf.  on  Parallel  Proc¬ 
essing,  1983.  pp.  148— 157. 

4.  P.  Quinton.  Automauc  synthesis  of  systolic  arrays  from  uniform 
recurrent  equations.  Proc.  Llth  Annual  Symposium  on  Computer 
Architecture.  1984,  pp.  208-214. 

5.  S.K.  Rao.  Regular  iterative  algorithms  and  their  implementa¬ 
tions  on  processor  arrays.  Ph.D.  Dissertation.  Stanford  Univer¬ 
sity,  Stanford.  California.  Oct.  1985. 

6.  M.  Chen.  A  design  methodology  for  synthesizing  parallel  algo¬ 
rithms  and  architectures.  Journal  of  Parallel  and  Distributed  Com¬ 
puting,  Dec.  1984  pp.  461-491. 

7.  J.-M.  Defosme  and  I.CF.  Ipsen.  An  Ulustranon  of  a  methodology 
for  the  construction  of  efficient  systolic  architectures  m  VLSI. 
Proc.  Second  Int  7  Symposium  on  VLSI  Technology,  Systems  and 
Applications.  1985.  pp.  268-273. 

8.  S.Y.  Rung.  VLSI  Array  Processors.  Englewood  Cliffs.  N.J.: 
Prentice-Hall.  1987. 

9.  C.  Guerra  and  R.  Meihem.  Synthesizing  non-uniform  systolic 
designs.  Proc.  Int  l  Conf.  on  Parallel  Processing.  1986.  pp. 
765-771. 

10.  G.-J.  Li  and  B.W.  Wah.  The  design  of  optimal  systolic  arrays. 
IEEE  Trans.  Computers.  Vbl.  C-34.  Jan.  1985.  pp.  66-77. 

1 1.  M.T.  O’Keefe  and  J.A.B.  Fortes.  A  comparative  study  of  two 
systematic  design  methodologies  for  systolic  arrays.  Proc.  Int  7 
Conf.  on  Parallel  Processing,  1986.  pp.  672-675. 

12.  J.A.B.  Fortes.  F.  Pansi-Presicae.  Optimal  linear  schedule  for  the 
parallel  execution  of  algorithms.  Proc.  Int  l  Conf.  on  Parallel 
Processing,  1984.  pp.  322-328. 

13.  R.  Cytron.  Doncross:  Beyond  vectonzabon  for  multiprocessors 
I  attended  abstract).  Proc.  Int  7  Conf.  on  Puallei  Processing.  1986. 
pp.  836-844. 

14.  W.  Shang  and  J.A.B.  Fortes.  Time  optimal  linear  schedules  for 
algorithms  with  uniform  dependencies.  Proc.  Int  l  Conf.  on 
Systolic  Arrays.  May  1988.  pp.  393-402. 

15.  D.A.  Padua.  Multiprocessors:  Discussion  of  theoretical  and  prac¬ 
tical  problems.  Ph.D.  Thesis.  Umv.  of  Illinois  at  Urb.-Champ. . 
Rept.  No.  UTUCDCS-R-79-990.  Nov.  1979. 

16.  J.-K.  Petr  and  R.  Cytron.  Minimum  distance:  a  method  for  par- 
ciootung  recurrences  for  multiprocessors.  Proc.  Int'l  Conf.  on 
Parallel  Processing,  1987.  pp.  217-225. 

17.  W.  Shang  and  J.A.B.  Fortes.  Independent  parndomng  of  algo¬ 
rithms  with  uniform  dependencies.  Proc  Int  7  Conf.  on  Psrailei 
Processing.  Vfol.  2.  1988.  pp  26-33. 

18.  PE.  Gill.  W.  Murray  and  M.H.  Wright.  Practical  Optimisation. 
New  Ybrk:  Academic  Press.  1981. 

19.  DG.  Luenherger.  Linear  and  Sonlinear  Programming.  Second 
Edition.  Menlo  Park.  California:  Addison-Wesley  Publishing 
Company.  1984. 


220  Shang  and  Fortes 


VtWjia  Shang  received  the  B.S.  degree  in  computer  engineering  and 
science  hem  Changsha  Institute  ot  Technology  in  1982  and  the  M.S. 
degree  in  electrical  engineering  from  Purdue  University  in  1984. 

She  is  currently  a  Ph.D.  student  in  the  School  of  Electrical  Engi¬ 
neering.  Purdue  University.  'Mat  Lafayette.  IN.  Her  research  interests 
include  parallel  processing,  computer  architecture,  scheduling  prob¬ 
lems.  and  systolic  arrays. 

Ms.  Shang  is  a  student  member  of  IEEE  and  ACM. 


Jana  A.B.  Fbrtta  has  been  with  the  Acuity  ot‘  Purdue  University  's 
School  of  Electrical  Engineering  since  1984. 

His  current  i  merest!  in  peril  lei  processing  include  die  systematic 
design  of  algorithmically  specialized  processes-- array  architectures, 
aurotnatic  parallelism  detection  and  eitptowncinrechmquBs.  and  hult- 
toierant  computing. 

Foam  haa  published  over  40  techracai  papen  in  journals  and  con¬ 
ference  proceedings  m  the  ateae  of  parallel  processing,  huMolenm 
computing.  and  VLSI  airtaaKrurw.  He  bee  aortnd  on  several  pto^ica 
■a  theae  anna  in  cooperaaon  with  or  with  funding  from  NSF.  ONR. 
ATKT.  GE.  RCA.  and  NCR. 

Fones  is  a  member  of  IEEE. 

He  received  hie  M.S.E.E.  and  Ph.D.  E.E.  degrees  from  Colorado 
State  University  and  (he  University  of  Southern  California  in  1981 
and  1983.  respectively. 


REFERENCE  NO.  2 


Chean,  M.,  and  Fortes,  J.  A.  B.,  "A  Taxonomy  of  Reconfiguration  Techniques  for 
Fault-Tolerant  Processor  Arrays,"  IEEE  Computer,  December  1990,  pp.  55-69. 

Note-  This  paper  surveys  and  classifies  processor  array  reconfiguration  schemes  pro¬ 
posed  by  the  authors  and  other  researchers.  Criteria  for  evaluation  of  the  schemes  and 
their  relative  merits  are  also  discussed. 


SURVEY  &  TUTORIAL  SERIES 


A  Taxonomy  of 
Reconfiguration  Techniques 
for  Fault-Tolerant 
Processor  Arrays 


Mengly  Che  an*  and  Jose  A. B.  Fortes 
Purdue  University 


Very  large  scale  integration/wafer- 
scale  integration  (VLSI/WSI) 
technology  is  most  advantageous 
when  used  to  implement  regularly  struc¬ 
tured  systems  such  as  large  arrays  of  iden¬ 
tical  processing  elements.  As  integration 
levels  increase  and  the  sizes  of  arrays  grow 
larger,  the  possibility  of  single  or  multiple 
faults  occurring  in  a  VLSI/WSI  array 
heightens.  These  faults  can  occur  during 
the  operational  lifetime  of  the  array,  as 
well  as  during  its  manufacturing  process. 
In  a  non-fault-tolerant  array  structure,  the 
failure  of  a  single  element  can  cause  the 
array  performance  to  degrade  severely  if 
not  fatally. 

On  the  other  hand,  the  array  might  be 
able  to  operate  in  a  fault-tolerant  reconfig- 
urable  structure,  even  if  a  certain  number 
of  faults  are  present.  This  can  be  done  by 
using  a  reconfiguration  technique  for  re¬ 
structuring  the  irray  statically  (at  fabrica¬ 
tion  time,  for  yield  enhancement)  or  dy¬ 
namically  (during  its  operational  lifetime, 
for  improved  reliability). 

In  recent  years,  a  large  number  of  re¬ 
configuration  schemes  have  been  pro¬ 
posed.  Before  deciding  which  techniques 
are  best  suited  for  particular  designs,  it  is 
important  to  understand  their  similarities 
and  differences.  This  article  introduces  a 
1 


This  proposed 
taxonomy  focuses  on 
FTPA  reconfiguration 
techniques  useful  for 
comparing  many 
possible  schemes.  It 
can  help  engineers 
and  researchers 
understand  existing 
approaches  and 
develop  new  ones. 


taxonomy  for  reconfiguration  techniques 
for  fault-tolerant  processor  arrays 
(FTPAs)  and  discusses  their  distinguish¬ 
ing  characteristics. 

The  article  focuses  on  the  characterize 


*  CTimb  is  now  with  dw  IMlaira  Ramil  C< 
Shcii  OavaiopiMM  Company.  Houston.  Texas. 


tion  and  classification  of  a  specific  feature 
of  FTPAs.  that  is.  reconfiguration,  as  op¬ 
posed  to  Abraham  et  al..1  which  surveys 
and  discusses  different  features  of  FTPAs. 
Schemes  can  be  differentiated  from  one 
another  according  to  the  type  of  redun¬ 
dancy  (time  or  hardware),  allocation  of 
redundancy  (local  or  global),  replacement 
unit  (processor  or  a  set  of  processors), 
switching  domain  (global  or  local),  and 
switching  implementation  (switchingele- 
ment,  bus,  or  network).  These  characteris¬ 
tics  can  be  used  as  the  basis  for  a  taxonomy 
of  reconfiguration  techniques  as  illustrated 
by  the  classification  trees  shown  in  Figures 
l  and  2. 

A  detailed  discussion  of  every  tree  leaf 
as  a  separate  and  distinct  class  is  neither 
practical  nor  possible  in-  thi*  article.  In¬ 
stead.  the  techniques  in  four  major  sub¬ 
trees  are  grouped  into  larger  classes  that 
correspond  to  the  following  distinguishing 
characteristics: 

•  local  redundancy  in  hardware  redun¬ 
dant  schemes, 

*  set  switching  in  schemes  with  global 
hardware  redundancy, 

*  processor  switching  in  schemes  with 
global  hardware  redundancy,  and 

•  time  redundancy. 


January  1990 


OOI*-»IM/W)I<XW»55JOIOO*  two  mas 


55 


Global* 

“switching? 


F^  Local-  _ 


SPSS3I4. 


•  Dlogeneaf-  ,  "‘«v> '■  /• ':  "‘  v: 

tM  •■  -.--  ■ 

•  >'  't.»*‘:-‘ 

"  • 

Cu-trtr  *  Direct  reconfiguration  (DR)f  -'•  VW';'- 

ohwnonr  *  Hierarchicalarra#processoe(HAF)r(rou*liiNe^ 

•  Compte*  fault  stealing  (CFSV* 

•  Ciminiera  et-aL* 

•Configurable-,  highly  parallel  (CHIP)?*  -  7 

(block  of  processing  elements)r-  *.3.  > 

_  ..  .  *Jervis-Sciutor  -  - 

.s"ntch  •  Lombardi  etahT  ..V 

503,1  •  Distant#  e*aL- 

•-YourhSInghP'  ' 

•  Full  use  of  suitabla  spares  (FUSS)? 


^  Loea»  _ / 


.Gtebab  ...  •  Massively  pan0ekpcoceseoe(MFP^ 

*  -  '-BtoSiWi«weeeoe(BAI^  • 1 
•‘i2  f  *'••  "swrttfr  Mesl»intereonnecttonchipt(Mte)r“  f 

’  fTetemenK  *MfcCan«HteWri^  ‘f  ’ 

-Kung-Laraf  -  '  -. 

v  L  t  -KuoFoehaf*  .  •  ••  '*  •;•'>• 

*J*  l  •Hlerarehlea»anaj»p«)ees80»(HWf? 

%  iftfafc  /  (cotanrsloeal*  : 

S’ix' "  ^switching — •J@sshope-Benttey’~(ariey.  level*’ 

•  Klm-Reddy* 


■  -  .v  >*-  -  S' 

SSSiW!  E^V^-  - 


Switch  •  Hwang-Raghavendra*“(B-bloclcotPE3> 

'  bus  * CHiP*  (column  of  blocks) 


e'if&i:  -■  Global* 

./  pnwteftingk 


•>1' 


•  Lwghton-teisefson* 

Switctr  •  Kunget  ak.— (array  level* 

"element  •  Singh?  ,  '  "  -  *•."••* 

•  Jesshope-8entle>^(P&ro<»leve9t  -  . 


wfWrt-' 


p_  Local*  Switch-  •  W^scale'9yetolk?pioceesor(We«rt(r 
switching. - bus  •  Lombaidi-Seiutor’  /t 

Switch-  •  Binary  array  proceseor(BAPr*(Pe  chip)* 
network.  •  HwarvRaKJhavendraT’fanayotabteck^ir' 


_  Gfofestth- 

switchings 


.  i  - 

•  ~  mas#: . 

.  y^A.VT^.'JJfry 

■.urvt.isSm-..’  .•  .v 


Local* 

"switching" 


•*  %*>>*.  i'efc '  .  7  v  . 

*’“vl*7v5r.Jv'  '  '  "  * S— ttWSt-nwtchlng  technqua*1  seedonol  -Further  leading.’ 

n^nuM’  »*c#0" al •Fuiitvtf  ' 

•■>•.*  ...  d— 8eee»<toc*<adundiincirtacnnieiia^ tafliowat "ftadiea raiding.’  . 


Figure  1.  Taxonomy  for  hardware-redundancy  reconfiguration  schemes.  (Asterisks  identify  the  roots  of  the  subtrees  and 
their  corresponding  classes.) 


56 


COMPUTER 


Figure  2.  Taxonomy  for  time-redundancy  reconfiguration  schemes.  (Asterisk 
identifies  the  root  of  the  subtree  and  its  corresponding  class.) 


in  Figures  i  ind2.  ister.s.ts  identify  the 
roots  of  the  ^-b trees  to  -  .uch  these  :  asses 
corresoona.  For  brevity,  these  casses  ire 
denoted  as  the  ocai-redundancy  class,  the 
processor-switching  class,  the  set-switch- 
>r;;  c.ass.  ind  the  time -redundancy  class. 
Sci vines  n  :ne  local-redundancy  class 
attempt  to  -ntnimize  the  maximum  wire 
'ength:  this  can  *esult  :n  some  waste  if 
functional  cells,  in  processor-switching 
schemes,  efficient  ceil  use  can  oe  ootained 
11  the  cost  of  reconfiguration  time  and 
hardware.  Reconfiguration  algorithms  in 
the  set-switching  class  are  less  complex 
man  local-redundancy  approaches  but  can 
waste  more  functional  ceils.  Finally,  when 
a  decrease  in  computational  speed  is  an 
acceptable  form  of  performance  degrada¬ 
tion.  time -redundancy  techniques  can  be 
considered  and  potentially  have  less  sili¬ 
con  area  overhead  than  those  that  use  hard¬ 
ware  redundancy. 

This  article  is  not  designed  to  exhaus¬ 
tively  survey  previously  proposed  recon¬ 
figuration  schemes.  Instead,  we  review 
typical  techniques  selected  from  each  class 
in  the  following  sections  to  discuss  and 
illustrate  the  main  characteristics  of  differ¬ 
ent  classes  of  schemes. 

Terminology 

The  reconfiguration  techniques  pre¬ 
sented  in  this  article  apply  to  an  (Af+R)  x 
(/V+O  array  of  cells  in  which  M  and  S 
represent  the  number  of  rows  and  columns, 
respectively,  and  R  and  C  represent  the 
number  of  spare  rows  and  spare  columns, 
respectively.  Figure  3b  shows  an  array 


Figure  3.  Mapping  of  a  logical  array  into  a  physical  array:  tat  logical  FTPA;  (b)  physical  FTP  A. 


i 


January  1990 


57 


with  one  spare  row  and  one  spare  column. 
Cells  are  denoted  by  squares,  and  crossed 
cells  represent  faulty  cells.  We  will  intro¬ 
duce  additional  circuitry  used  to  bypass 
faulty  cells  when  we  discuss  appropriate 
reconfiguration  techniques. 

A  fault-tolerant  processor  array  (see  the 
sidebar.  “The  ubiquitous  processor  array." 
for  background  on  processor  arrays)  is  a 
two-dimensional  array  of  identical  and 
regularly  interconnected  processing  ele¬ 
ments  (PEs)  incorporating  redundant  cir¬ 
cuitry  (spares)  and  hardware  for  recon¬ 
figuration.  A  cel I  is  a  processing  element 
of  an  FTPA.  An  available  cell  is  a  good, 
functional  cell  (also  called  fault-free  or 
live  cell).  An  unavailable  cell  refers  to 
either  a  faulty  cell  or  a  fault-free  cell  that  is 
already  used  to  replace  another  cell. 

A  group  of  cells  in  a  row  .  column,  or 
block  is  called  a  set.  A  functional  set  con¬ 
tains  cells  that  are  all  functional.  In  con¬ 
trast.  a  set  that  contains  one  or  more  faulty 
cells  is  faulty.  For  example,  a  row  that 
contains  one  or  more  faulty  cells  is  a  faulty 
row.  and  an  array  that  contains  one  or  more 


faulty  cells  is  a  faulty  array. 

A  physical  FTPA  is  a  nonreconfigured 
FTPA  that  can  contain  faulty  cells.  A  cell 
in  a  physical  FTPA  is  denoted  by  (/.  j). 
where  i  and  j  represent  the  cell  coordinates. 
\<i<M+R.  1  ij'&N+C.  A  logical  FTPA  is  a 
reconfigured  FTPA  that  is  fully  opera¬ 
tional.  A  cell  in  a  logical  FTPA  is  denoted 
by  (/',  j\  where  i  and  >  represent  its  logical 
coordinates.  \<i<M  !Sy</V. 

An  array  reconfiguration  is  a  mapping  o 
of  logical  cells  in  a  logical  FTPA  inio 
functional  cells  in  a  physical  FTPA  (see 
Figure  3).  Thus.  0  (|i,y‘J)  =  (U)  if  logical 
cell  [i.j]  corresponds  to  physical  cell  Ik.h. 
The  inverse  of  o  maps  a  functional  physi¬ 
cal  cell  to  a  logical  cell,  for  example,  9''  (ft. 
j))  =  \mji j.  Also.  0  and  9  map  a  logical  cel) 
to  the  corresponding  physical  row  and  the 
corresponding  physical  column, 
respectively:  0 ''  and  9,‘  are  defined  simi¬ 
larly.  For  example.  9(|m..t))  =  k.  if  logical 
cell  ( m.n]  is  on  physical  row  L  If  a  map¬ 
ping  between  all  cells  in  a  logical  FTPA 
and  cells  in  a  physical  FTPA  exists,  the 
reconfiguration  is  said  to  be  successful. 


The  probability  of  survival  o(.v)  (also 
called  survivability )  is  the  conditional 
probability  of  success  in  reconfiguring  an 
array,  given  that  it  has  x  faulty  cells.  In 
some  instances,  reconfiguration  is  aimed 
at  constructing  an  array  with  the  largest 
possible  number  of  functional  cells:  the 
target  (logical)  array  si2e  does  not  need  to 
be  fixed.  In  this  case,  the  term  harvest 
refers  to  the  number  of  fault-free  cells  used 
to  construct  the  largest  logical  array  pos¬ 
sible.  Ideally,  the  harvest  is  the  number  of 
cells  in  the  physical  array  minus  the  num¬ 
ber  of  faulty  cells.  In  an  FTPA.  if  5  is  the 
total  number  of  spare  cells  and  F  the  total 
number  of  faulty  cells,  then  the  spare 
aemand  p  is  defined  as  p  =  FIS. 

Four  figures  of  merit,  simplicity, 
efficiency,  area,  and  locality  (abbreviated 
SEAL),  are  useful  to  evaluate  a  reconfigu¬ 
ration  scheme. 

Simplicity  refers  to  the  time  complexity 
of  a  reconfiguration  algorithm  as  a  func¬ 
tion  of  the  number  of  processors  in  an 
FTPA.  A  simple  algorithm  requires  a  short 
execution  time  that  should  be  polyno- 


Tfae  ubiquitous  processor  array 


In  Its  basic  form,  a  processor  array  is 
simply  a  sat  of  processors  organized 
as  a  grid  where  each  point  corresponds 
to  a  processing  element  end  lines  be¬ 
tween  points  correspond  to  communi¬ 
cation  Units  between  processors.  This 
-mesh-Wte -organization  has  been  pro¬ 
posed  since  the  earty  Pays  of  digital 
computers  and  has  been  redtocovered 
•for  many  different -applications -end 
Sat  hnnUnjlae  !f|rtaff  irtix  mstoni  nf  the 
history  4*  earty  prsosssor  arny  designs 
cantos  Sound  (nltor1  and  fVastonand 
Stiff.*  .  .  .  "  - 

Stare  are -many  undents  cf<pfoces- 
For  example, 
icom- 


S>iex*s  amodemmtorepmcaasor  or  as 
eimpteas  an  arithmetic  4ogiC4jnfttn 
(fact,  memory -arrays  can  he -considered 
anaKbametyaimpte  example  of  pro- 
oeeeor arrays.. particularly  in  die  caee 
of  associative  -memories .  Thaaee 
ateomanyvariattonscs  tohowinsem*, 
atone  are  executed  .-(centratasd  versus 
Mtofrto  utsrt  control)  awdhow  memory 
and  fnputtoutput  are organtoed.  fn-addF 
-  don,  die  basic  meeh  Sopofogyts  often 
•nksermut- 


ring  mechanisms  1o  support  nrbKrwy 
.aswenuntoarion  panama.  The  refer- 
ences  (stsdhersprovidomsnyax- 


amples  of  these  variants.  Recently,  a 
particular  type  of  architecture,  denoted 
systolic  array,  has  bean  used  in  a  large 
number  of  proposals  for  special-pur¬ 
pose  architectures.  A  sample  of  this 
type  of  work  can  be  found  in  Com¬ 
puter*  and  Mead  and  Conway:* 
Processor -arrays  -are  pervasivs-par- 
attei  architectures  fortwo  main  rea¬ 
sons:  (l)lhey  suit  digital  electronic 
technologies  Xparbcularty  VLS1/WSI), 
«nd<2)  they  can  be  structurally  well- 
matched  to  many  applications.  The 

w fnpWnlmHlOfl  rM UHl^TOTTi HI 

moduiamyandtoca)  connectivity, 
totodutority  means  that  ft  suffices  tode- 
aign  one  module  —  a -processing  wie- 
mentawdaasocli 
that  can  then  be  repiic^sd-to  -buUd  ar 
aaysof  arbitrary-aize-fuplo  a  maximum 
determined  by  factors  such  as  teUsbil- 

ilji'POwSfi  OOCK  OwnOUIIO>i)i  LOC8J 

connectivity means  -that  -each  procas- 

boring  processors,  thus  causing  (Ul* 
meedtor  connection  wires  to  crossover 
—  a  factor  that  iadiitates  planar  VLSI/ 

9VCH  d»npitm6nUlu0n .  Pul  U  IflfMiiny 

end** -subject  is  available  in -Mead  and 
Cbmny.* 

The  mesh  topology  rswret  suhsd  tor 
i  imagsero- 


cessing  and  understanding,  pattern 
recognition,  adaptive  array 
beam  forming,  matrix  computations, 
partial  differential  equations,  bit-level 
arithmetic,  digital  filtering,  cellular  auto¬ 
mata,  computer  graphics.  Monte  Carlo 
-simulations,  and  games.  Nowadays, 
many -special-purpose -and/orcommer- 
dal  processor  arnayaar®  being  used 
tor  these  applications. 'Foot  yield  and 
tatiabWty  remain -important  obstacles  4o 
making  these  machines  more  tofe- 
flreted,  Sees  expeoeive.-Sargef^n  terms 
ot  numbers  of  brocesaors).and  more 


connections—  References 


1.  Hhr,  L,  Computer  Amy*  and  Nmtmorkt: 
yUgorithm-Struaund  ParafW  Arshtoc- 
ams.WoedemicPiess.-New  York.  1902. 

2.  Preston.  K..  Jr„  xnd  M.XD.  Putt. 

am  CWMar Automata  Theory  end  AppK- 
,  Planum. -New  York,  196*. 


J.  tOonpmer-fopoclelioouo  oneyetose  sr- 
aaysl.*toL20,*toT,  July  1987. 


r.rMead,  C..-and  CConwey. 

aoWLSf  Syetomt,«ecaon  SJ.TWgo- 
dtenotor  VLSI^eaoooor  Ansya,*  Ad- 


CO-  271-292. 


58 


COMPl.  •  P 


nially  bounded. 

Efficiency  refers  to  spare  use.  An  effi- 
.  i  ent  scheme  wastes  none  or  very  few  spare 
ells  and.  thus,  acnieves  a  very  high  array 
survivability  and  harvest. 

Area  refers  to  the  overhead  of  the  added 
merconnect  and  reconfiguration  circuitry. 
Low-overhead  schemes  are  desirable  be¬ 
cause  a  large  silicon  area  increases  the 
probability  of  having  more  defective  ele¬ 
ments. 

Locality  means  that  physical  intercon¬ 
nections  between  logically  adjacent  cells 
in  a  reconfigured  array  should  have  mini¬ 
mal  lengths.  It  determines  the  maximum 
delay  in  signal  propagation,  therefore  lim¬ 
iting  the  clock  rate  at  which  the  array  can 
operate. 

Distinguishing  between  two  types  of 
array  reconfiguration  that  differ  in  their 
goals  and  in  when  they  take  place  is 
important.  Static  reconfiguration  is  per¬ 
formed  at  array  fabrication  time  and  often 
uses  hard  (fixed)  interconnections  to  by¬ 
pass  defective  cells  to  enhance  yield. 
Dynamic  reconfiguration  is  performed 
during  the  operational  lifetime  of  the  array 
and  over  time  uses  soft  (programmable) 
interconnections  to  logically  replace  the 
faulty  cells  to  improve  array  reliability. 

The  two  types  of  reconfiguration  also 
differ  in  the  relative  weighting  of  the  SEAL 
figures  of  merit.  For  example,  in  dynamic 
reconfiguration,  it  is  important  to  have 
simple  schemes  to  avoid  degradation  of 
system  performance.  On  the  other  hand,  in 
static  reconfiguration,  using  a  so-  s what 
slower  technique  to  achieve  higher  effi¬ 
ciency  can  be  desirable  and  affordable.  In 
both  types,  however,  a  major  considera¬ 
tion  is  to  minimize  area  that  can  impact 
yield  and  reliability  to  different  degrees. 
Finally,  manufacturing  defects  and  opera¬ 
tional  faults  can  have  different  distribu¬ 
tions  on  the  array,  and  evaluation  of  a 
particular  scheme  must  take  this  fact  into 
consideration. 

Taxonomy  for 

reconfiguration 

techniques 

To  illustrate  reconfiguration  character¬ 
istics.  we  use  three  schemes  from  Kung 
and  Lanri: 

•  Scheme  I.  Faulty  rows  and/or  columns 
are  skipped  (this  is  illustrated  in  Figure  4). 

A  greedy  algorithm  cai^  be  used  to  first 


Figure  4.  A  (2  x  3>  reconfigured  FT  PA 
(middle  row  is  skipped  over). 


eliminate  the  row  or  column  containing  the 
most  faulty  cells. 

•  Scheme  II.  The  strategy  is  to  build  an 
array  row  by  row,  picking  as  the  next  cell 
the  one  that  satisfies  a  predetermined  crite¬ 
rion  and  that  excludes  the  least  number  of 
live  cells  from  being  used.  To  accommo¬ 
date  delay  between  two  adjacent  cells  in  a 
logical  array,  some  “delay  registers”  can 
be  used.  A  simple  “maze-runner”  can  be 
used  to  determine  the  interconnect  require¬ 
ments. 

•  Scheme  III.  The  FTPA  is  partitioned 
into  horizontal  strips,  and  the  cells  in  each 
strip  are  connected  to  form  logical  rows. 
These  logical  rows  are  then  connected  to 
form  the  whole  array.  Also,  delay  registers 
are  used  in  the  reconfigured  array. 

The  taxonomy  proposed  in  this  article  is 
based  on  a  variety  of  reconfiguration  char¬ 
acteristics.  namely  type  of  redundancy, 
allocation  of  redundancy,  replacement 
unit,  switching  domain,  and  switching 
mechanism. 

The  type  of  redundancy  can  be  hardware 
or  time.  Hardware  redundancy  indicates 
that  spare  cells  are  available  to-  logically 
replace  faulty  cells,  whereas  time  redun¬ 
dancy  means  that  operational  cells  are 
assigned  the  computations  that  faulty  cells 
would  perform  if  they  were  operational. 
All  three  schemes  outlined  above  use  hard¬ 
ware  redundancy.  An  array  can  use  both 
types  of  redundancy  (for  example,  when 
graceful  degradation  is  allowed  after  all 
physical  spares  have  been  used),  but  this  is 
equivalent  to  having  two  reconfiguration 
schemes  for  the  same  array. 

The  allocation  of  redundancy  can  be 
local  or  global.  Local,  redundancy  corre¬ 


sponds  to  the  case  when  spares  can  logi¬ 
cally  replace  only  cells  in  specific  pans  of 
the  array  (horizontal  strips  in  Scheme  IQ 
above).  In  a  broad  sense,  this  is  equivalent 
to  partitioning  the  array  into  smaller 
blocks,  each  reconfigured  locally.  In  some 
cases,  when  a  block  cannot  be  reconfig¬ 
ured,  a  block-replacement  scheme  can  be 
used  to  substitute  the  full  block  for  a  good 
one.  These  techniques  have  a  hierarchical 
nature  and  correspond  to  a  combination  of 
reconfiguration  schemes  that  can  belong  to 
different  classes.  Global  redundancy,  on 
the  other  hand,  means  that  any  of  the  spare 
cells  can  logically  replace  any  faulty  cell. 
In  some  schemes,  any  cell  can  be  either  a 
spare  or  a  regular  cell. 

The  replacement  unit  can  be  a  cell  or  a 
set  of  cells,  and  the  array  is  said  to  be 
processor  switched  or  set  switched,  re¬ 
spectively;  In  processor-switched  arrays, 
the  replacement  of  a  faulty  cell  only  re¬ 
quires  the  logical  removal  of  that  cell.  On 
the  contrary,  in  set-switched  arrays,  the  re¬ 
placement  of  a  faulty  cell  requires  the 
Logical  removal  and  replacement  of  a  set  of 
ceils,  such  as  a  column  or  a  row  of  cells 
(see  Scheme  I  above). 

The  switching  domain  can  be  distrib¬ 
uted  or  global.  The  form  of  connection 
between  any  two  cells  depends  on  their 
physical  location  when  distributed  (local) 
switching  is  used;  topologically  close  cells 
are  more  easily  connected,  whereas  distant 
cells  cannot  be  connected  or  require  more 
switches/buses/linking  elements  for  the 
connection.  In  an  array  with  global  switch¬ 
ing.  the  form  of  connection  between  any 
two  cells  is  independent  of  their  position  in 
the  array.  Usually,  distributed  switching 
(as  in  ail  the  schemes  illustrated  above) 
allows  the  interspersing  of  switching 
mechanisms  with  ceils  (for  example-,  a 
switch  lattice),  whereas  global  switching 
requires  the  separation  of  the  reconfigura¬ 
tion  hardware  from  the  processing  ele¬ 
ments  (for  example,  a  global  bus  or  inter¬ 
connection  network). 

The  switching  mechanism:  can  be  a 
switch  element,  a  switch  bus.  or  a  switch 
network.  A  switch  element  consists  of  ceils 
or  other  elements  (for  example,  multiplex¬ 
ers  or  delay  registers  as  in  Schemes  II  and 
III)  that  are  capable  of  routing  data.  In 
practice,  this  corresponds  to  the  existence 
of  bypassing  mechanisms  “inside”  the 
elements  that  connect  inputs  directly  to 
outputs.  A  switch  bus  consists  of  a  bus  and 
switches  where  different  wires  can  be 
connected  to  different  ceils  to  establish  a 
particular  interconnection  pattern.  Dis¬ 
tinct  segments  of  each  wire  can  also  be 


January  1990 


59 


Figure  5.  An  FTPA  structure  with 
multiplexer  hardware  to  support 
switching. 


Figure  6.  An  example  of  the  Kuo* 
Fuchs  approach.4 


Figure  7.  Bipartite  description  of  the 
fault  distribution  in  Figure  6/ 


used  to  connect  different  pairs  of  cells,  and 
switches  are  used  to  make  or  break  connec¬ 
tions  between  wire  segments  and  be  tween- 
wires  and  cells.  A  switch  network  consists 
of  a  collection  of  interconnected  switches 
that,  depenc.ng  on  the  setting  of  each 
switch,  estaoiish  connections  between  the 
inputs  ana  outputs  of  specific  cells  (for 
example,  a  crossbar  network,  a  multistage 
interconnection  network,  or  a  lattice  of 
switches  interspersed  among  cells). 

Figures  1  and  2  show  the  classification 
trees  defined  by  the  taxonomy  just  intro¬ 
duced.  Each  leaf  of  the  trees  corresponds  to 
a  different  class  of  reconfiguration 
schemes.  References  for  selected  schemes 
discussed  in  this  article  (including  the  ref¬ 
erences  in  the  ‘Further  reading"  sidebar) 
are  shown  next  to  the  leaves  of  the  trees.  In 
these  references,  readers  can  find  detailed 
descriptions  of  schemes  in  the  classes 
specified  by  the  respective  leaves  (how¬ 
ever.  readers  are  cautioned  that  some  refer¬ 
ences  describe  several  schemes,  possibly 
from  distinct  classes).  When  the  same 
scheme  appears  in  more  than  one  leaf,  it 
indicates  a  hierarchical  combination  of 
two  different  schemes,  each  of  which  cor¬ 
responds  to  a  leaf  or  class.  In  these  cases,  in 
addition  to  a  reference,  the  domain  of  re¬ 
configuration  is  specified  in  parenthesis. 
As  already  mentioned,  the  large  number  of 
types  of  schemes  determined  by  the  taxon¬ 
omy  are  grouped  into  four  major  classes. 
This  is  done  to  facilitate  the  discussion  and 
comparison  of  fundamentally  different 
approaches  to  the  reconfiguration  prob¬ 
lem.  Three  of  these  four  major  classes 
correspond  to  hardware  redundancy  solu¬ 
tions.  namely  local  redundancy,  processor 
switching  with  global  redundancy  and  set 
switching  with  global  redundancy.  The 
fourth  major  class  contains  schemes  that 
use  time  redundancy.  For  simplicity,  these 
major  classes  are  called  the  local-redun¬ 
dancy  class,  the  processor-switching  class, 
the  set-switching  class,  and  the  time-re¬ 
dundancy  class.  Asterisks  indicate  the 
roots  of  the  subtrees  (see  Figures  1  and  2) 
that  correspond  to  these  classes. 

Set-switching  class 

In  the  set-switching  class,  the  replace¬ 
ment  of  a  faulty  ceil  requires  the  logical 
removal  and  replacement  of  a  set  of  cells. 
Typical  replacement  sets  are  FTPA  rows 
and  columns.  A  typical  array  reconfigura¬ 
tion  is  shown  in  Figure  4  where  one  row  of 
the  array  is  skipped  (assuming  one  spare 
row  is  provided).  This  is  accomplished,  for 


instance,  by  setting  simple  multiplexer 
switches  that  have  been  incorporated  into 
the  FTPA  structure,  as  shown  in  Figure  S. 
The  bypass  circuitry  (multiplexer)  can  be 
implemented  inside  a  cell,  as  proposed  in 
Koren.1  Bypassing  a  faulty  row  is  accom¬ 
plished  by  turning  all  cells  (including 
faulty  cells)  in  that  row  into  "connecting 
elements"  to  allow  signals  to  go  through 
them.  The  main  advantage  of  the  schemes 
in  this  class  is  their  simplicity.  Since  the 
switches  used  to  bypass  (or  to  pass 
through)  a  faulty  set  are  simple,  the  cost  of 
redundant  interconnect  and  control  cir¬ 
cuitry  is  minimal.  On  the  other  hand,  the 
waste  of  cells  is  the  main  drawback. 

There  are  several  schemes  in  this  class, 
many  of  which' use  a  spare- row,  a  spare 
column,  or  both.  The  main  distinction 
among  them  is  the  way  switch  elements  are 
implemented.  Certain  schemes  also  use 
multiple  spare  rows  and/or  columns,  and  a 
typical  one  is  discussed  next.  (See  "Further 
reading”  for  additional  references  on  this 
topic.) 

The  Kuo-Fuchs  strategy.  Consider  a 
(7+2)  x  (9+3)  FTPA  as  shown  in  Figure  6. 
in  which  circles  represent  faulty  cells.  A 
general  set-replacement  algorithm  can 
replace  faulty  rows/columns  by  proceed¬ 
ing  from  left  to  right  and  from  top  to  bot¬ 
tom;  that  is,  rows  l  and  3  can  be  replaced 
by  the  two  spare  rows,  and  columns  l,  3. 
and  4  can  be  replaced  by  the  three  spare 
columns.  Obviously,  this  strategy  fails  to 
reconfigure  the  array.  In  the  Kuo-Fuchs 
strategy,4  rows  and/or  columns  that  con¬ 
tain  the  most  number  of  faults  are  replaced 
first.  To  facilitate  this,  the  FTPA  is  mod¬ 
eled  as  a  bipartite  graph  whose  two  sets  of 
nodes  are  array  rows  and  columns  that 
contain  faulty  cells,  and  edges  are  faulty 
cells.  Figure  7  shows  the  bipartite  descrip¬ 
tion  of  the  array  in  Figure  6.  In  the  graph. 
R.  and  C  represent  faulty  row  i  and  faulty 
column  j,  respectively.  There  are  three 
edges  from  Rt,  namely  to  columns  Cv  Ct, 
and  Cr  since  there  are  three  faulty  cells 
(4,3),  (4.4),  (4.7).  The  Kuo-Fuchs  ap¬ 
proach  replaces  R,  and  R,  with  the  two 
spare  rows  and  C, ,  C4,  and  C,  with  the  three 
spare  columns,  and  the  reconfiguration  is 
successful. 

The  purpose  of  the  bipartite  description 
is  to  transform  the  reconfiguration  prob¬ 
lem  into  a  graph-theoretic  question,  that  is. 
that  of  finding  the  minimum  cover  of  the 
bipartite  graph  mentioned  above.  Kuo  and 
Fuchs4  have  shown  that  the  optimal  solu¬ 
tion  to  the  problem  is  NP-compiete  (non- 
deterministic  polynomial-time  complete) 


60 


COMPUTER 


Taole  1.  Summit'  y  c-f  set-switching  reconfiguration  schemes. 


Reconfiguration  Scheme 

Redundant  Set 

Massively  naraiic!  processor  (MPP)* 
Binary  ai?r.v  pr  ic-.sjor  (BAP)- 

Group  of  128  x  4  processing  elements 

Mesh  lniercoiwvct’on  chip  (MIC)-* 

Column  of  interconnect  nodes 

Koran1 

Row/column  of  PEs 

McCanny-McW  hitter* 

Row  ot'  PEs 

Kim-Reddy* 

Row  of  PEs 

Kuo-Fuchs* 

Row/Coiumn  of  PEs 

Hwang-Raghavendra*** 

Row  of  PEs 

Configurable,  highly  parallel  (CHiP)‘° 

Column  of  blocks  of  processing 
elements 

Jesshope-Bentley*** 

Row  of  PEs 

*  See  the  "Set-switching  techniques'*  section  of  “ 

:urther  reading." 

<##  See  the  "Local-redundancy  techniques '  section  of  “Further  reading.' 

{S 

■ 

■- , 

j .  Pi  M 

8S 

■ 

■  i 

- 

I  (*  *  i 

!  .  .Hi 

!■ ;  |P}~  'MT- 

|  -  -  F  ■■ 

|  ■  pr  mi 

Ai 

'■  ;p-j~ 

■ 

pi. 

m 

■ 

K 

„  ;Pcz-’ 

■ 

Pi 

M  " 

i  "  ■  ‘ 

1  a  | 

■ 

M 

■  ; 

m 

:|  ■!  >14 

u 

>14 

pi 

M 

f  L  ':•]  .  ■ 

■ 

■ 

■ 

1  m  pi 

■ 

■ 

pi 

Figure  8.  A  possible  layout  for  the  Diogenes  approach. 


md  proposed  two  algorithms.  One.  for 
easonably  small  arrays,  uses  a  branch- 
and-bound  approach  to  reduce  the  com¬ 
plexity  of  searching  for  a  minimum  cover. 
The  other,  for  very  large  arrays,  is  a  poly- 
lomial-ume  approximation  algorithm  that 
runs  efficiently  ami  often  yields  optimal 
solutions. 

In  summary,  a  scheme  in  the  set-switch- 
ng  class  is  characterized  by  its  replace- 
neat  unit,  whether  it  be  a  row,  a  column,  or 
a  subarray.  Since  the  replacement  unit  is  a 
>et  of  cells,  a  very  large  number  of  spares 
might  be  required  to  provide  sufficient 
fault  tolerance  for  an  array.  However,  the 
simplicity  of  the  algorithm  and  the  switch- 
:ng  mechanism  compensates  for  this 
wasteful  use  of  spares.  All  set-switching 
techniques  described  here  and  in  “Further 
reading”  are  summarized  in  Table  l .  which 
shows  the  corresponding  replacement  sets. 

Processor-switching 

class 

In  processor-switching  schemes,  an 
available  cell  directly  or  indirectly  re¬ 
places  a  faulty  cell.  For  this  reason,  the 
waste  of  functional  cells  can  be  reduced. 
There  are  two  approaches  in  this  class.  In 
one  approach,  functional  cells  are  system¬ 
atically  collected  (such  as  in  a  linear  fash- 
on.  as  in  Scheme  ID  to  form  a  part  (say,  a 
row)  of  the  target  array.  In  another  ap¬ 
proach.  faulty  cells  are  replaced  by  their 
neighbors  and  those  neighbors  are  further 
replaced  by  their  neighbors  and  so  on.  in  a 
chain  fashion,  until  spare  cells  are  reached. 
Next,  we  discuss  the  Diogenes  scheme3  as 
an  example  of  the  first  approach,  and  the 
fault-stealing*  and  FUSS3  schemes  as  ex¬ 
amples  of  the  second  approach. 

Diogenes  strategy.  In  the  Diogenes- 
strategy  approach.3  functional  cells  are 
collected  by  a  mechanism  consisting  of 
programmable  switches  and  bundles  of 
w  ire  that  run  above  a  line  of  cells  and  spare 
cells  (that  is.  the  array  is  linearized).  Faulty 
cells  are  simply  left  unconnected.  This 
procedure  allows  the  Diogenes  algorithm 
to  achieve  100- percent  use  of  fault-free 
PEs.  Figure  8  illustrates  the  Diogenes 
■concept  Small  squares  above  individual 
cells  represent  switches  that  can  connect 
fault-free  cells  (unmarked  squares)  or 
bypass  faulty  ceils  (squares  marked  with 
X).  The  switching  emulates  a  stack,  (or 
pueue)  mechanism  that  ‘‘pushes"  and 
pope"  wires  “into"  and  “out  of*  proces- 
■on.  For  anAfxJV  FTPA  (reckfoguiar  grid). 


the  scheme  needs  three  control  lines  and 
two  bundles  of  wires.  N  lines  for  the  stack 
and  N  ( plus  some  other  lines )  for  the  queue. 

(n  the  Diogenes  approach,  the  switch- 
bus  mechanism  is  the  most  important  part 
of  the  design.  By  adequately  setting  the 
switches,  implementing  different  connec¬ 
tion  patterns  is  possible.  For  a  specific 
desirable  connectivity,  the  number  of 
wires  and  switches  can  be  chosen  so  that 
reconfiguring  the  array  is  always  possible. 
One  of  the  advantages  of  this  approach 
includes  the  possibility  of  implementing 
different  topologies,  including  two-di¬ 
mensional  arrays.  Its  stack/queue-like 


switch  bus  allows  the  scheme  to  achieve 
100-percent  use  of  fault-free  cells.  How¬ 
ever.  a  large  area  must  be  allocated  for  the 
switch  bus  that  might  itself  fail.  Another 
potential  disadvantage  is  that,  in  the  pres¬ 
ence  of  consecutive  faulty  PEs.  logically 
adjacent  PEs  can  be  very  far  apart,  thus 
requiring  longer  interconnect  links.  For 
this  reason,  additional  vertical-shortcut 
paths  across  the  array  have  been  consid¬ 
ered.  The  problem  can  be  alleviated  if 
somewhat  simpler  switching  mechanisms 
are  used  at  the  expense  of  lower  array 
survivabilities.  This  strategy  is  imple¬ 
mented  in  many  schemes,  two  of  which  are 


January  1990 


61 


Figure  9.  Array  reconfigured  by  direct-reconfiguration  (DR)  scheme. 


Figure  10.  Array  reconfigured  by  complex-fauit-stealing  (CFS)  scheme. 


presented  next.  The  Diogenes  approach  is 
best  for  small  arrays  with  large  processing 
elements. 

Fauit-steaiing  strategy.  Fault-stealing 
techniques.6  also  known  as  index-mapping 
schemes,  reconfigure  an  FTPA  in  such  a 
way  that  a  faulty  cell  is  replaced  by  a 
neighbor  and  this  neighbor,  in  turn,  is  re¬ 
placed  by  its  neighbor  and  so  on,  until  a 
spare  cell  is  used.  The  spare  cell  is  said  to 
logically  replace  the  faulty  cell;  the  term 
“shift”  is  used  to  mean  direct  replacement: 
a  cell  is  shifted  to  another  if  the  former  is 
d  irectly  replaced  by  the  latter.  For  instance, 
in  Figure  9,  cell  (1.3)  can  be  replaced  by 
(1.5)  using  the  following  successive  direct 
replacements:  O(fl.yj)  =  (1. y+1 ).  j  =  3.  4. 
This  is  referred  to  as  a  “chain  shift.” 

Direct  reconfiguration  (DR)  is  a  simple 
scheme  that  is  the  basis  for  the  family  of 
fault-stealing  techniques.  It  uses  one  row 
and  one  column  of  spares  (R  «  C  =  1 ).  The 
algorithm  begins  by  scanning  each  column 
upward,  marking  the  first  faulty  cell  (if 
any)  of  each  row  as  a  vertical  fault  and 
other  faults  as  horizontal  faults.  For  the 
array  shown  in  Figure  9,  (5,1),  (4.2),  (1.3), 
and  (4.4)  are  vertical  faults,  and  other 
faulty  cells  are  horizontal  faults.  Horizon¬ 
tal  faults  are  chain-shifted  right,  and  verti¬ 
cal  faults  are  chain-shifted  down.  In  Figure 
9,  (1.2)  and  (2.2)  are  chain-shifted  right, 
and  (4.2),  (1.3),  and  (4.4)  are  chain-shifted 
down.  Note  that  (1.2)  and  (2.2)  were  First 
shifted  right  and  then  shifted  down  (that  is. 
they  were  shifted  twice).  The  algorithm 
fails  when  there  is  more  than  one  horizon¬ 
tal  fault  in  one  row  of  the  array.  Using  this 
scheme,  the  array  in  Figure  9  is  success¬ 
fully  reconfigured.  The  figure  shows  both 
physical  and  logical  coordinates. 

Ir  DR.  a  faulty  cell  (i,  j)  can  be  shifted 
right  to  a  fault-free  cell  (i.y'-t-l)  or  down  to 
(i+l,j).  The  set  consisting  of  cells  (/,/),  (i, 
y-t- 1 ) ,  and  (/+ 1 .  j)  is  referred  to  as  an  “adja¬ 
cency  domain.”  A  family  of  schemes  can 
be  extended  from  the  DR  scheme  by  vary¬ 
ing  the  size  of  (number  of  cells  in)  the  ad¬ 
jacency  domain  and  by  modifying  the 
replacement  rules.  The  schemes  in  this 
family  are  called  fault-stealing  schemes. 
In  order  of  increasing  complexity,  they  are 
fault  stealing  of  simple-fixed  choice, 
simple-variable  choice,  complex-fixed 
choice,  and  complex-variable  choice. 

The  fixed-stealing  algorithm  scans  all 
rows  in  the  order  of  increasing  index 
numbers.  In  each  row,  the  rightmost  un¬ 
available  ceil  is  shifted  right,  and  other 
faulty  cells  “steal”  (are  vertically  shifted 


62 


COMPUTER 


Figure  11.  Array  reconfigured  by  fuil-use-of-suitable-spares  scheme  with  two 
columns  of  spares  (FUSS-2). 


Figure  12.  Switch-bus  structure  used  in  FUSS-1  and  CFS:  (a)  cell  and  switches, 
and  tb)  switch  settings. 


>)  available  cells  on  the  next  row.  Vari- 
ble  stealing  differs  from  fixed  stealing  in 
ne  choice  of  the  unavailable  cell  in  a  row 
>  shift  right.  Instead  of  selecting  the  right- 
nose  unavailable  cell,  variable  stealing 
.elects  the  faulty  cell  that  cannot  steal  a 
.  ill  on  the  next  row.  The  rest  of  the  algo- 
-lthm  is  basically  the  same  as  in  DR. 

The  two  algorithms  we  just  described 
re  referred  to  as  simple-stealing  algo¬ 
rithms.  Vertically,  (i.j)  can  be  only  shifted 
;o  (i+J  ,j).  In  complex  fault  stealing  (CFS). 
i.f)  can  be  also  shifted  to  (i'+ 1  ,j+ 1 ).  For  an 
iV+1)  x  (/V+ 1 )  array,  the  complete  CFS 
algorithm  is  outlined  as  follows: 

(1)  Assume  that  in  row  i,  1  SiiM,  there 

are  faulty  or  stolen  (unavailable) 
cells  (1,1:,) . (ijc()  with  <fc,<  ...  <Jr. 

(2)  For  each  k.,  0<i<r: 

(a)  If  (h- 1  Ik.)  is  fault  free,  shift  (ijk.) 
to  it. 

(b)  Else  if  (i+ljfc.+l)  is  fault  free, 
shift  (ijc.)  to  it. 

(c)  Otherwise.  (i,k)  is  shifted  right. 

(3)  If  no  cell  is  shifted  along  the  tow  as 
per  rule  (2).  then  (i.kt)  is.  Otherwise. 
(ijc)  is  shifted  downward  to  either 

(i+1  Jkt)  or  (i+1,^+1). 

As  an  example,  consider  the  array  shown 
in  Figure  1 0.  In  relation  to  the  case  shown 
in  Figure  9.  this  array  contains  an  addi¬ 
tional  faulty  cell  ai  (2.3).  Were  DR  used, 
the  reconfiguration  would  fail  because, 
there  are  two  horizontal  faults  in  the  first 
row.  However,  if  CFS  is  used,  then  the 
reconfiguration  is  successful,  as  Figure  10 
shows.  The  figure  shows  logical  coordi¬ 
nates  and.  to  avoid  cluttering,  does  not 
show  connections  for  logical  cells  on  the 
same  columns. 

The  algorithm  fails  when,  along  a  single 
row.  two  or  more  cells  are  shifted  right,  as 
mthecaseofFigure  1 1  where  another  fault 
occurs  at  cell  (2,4).  The  interconnect  links 
required  by  the  algorithm  become  more 
complex  than  that  required  by  DR.  but  can 
be  implemented  by  the  switch  bus  shown 
in  Figure  12.  On  the  other  hand,  the  proba¬ 
bility  of  survival  is  better  than  DR  surviva¬ 
bility. 

FUSS  strategy.  A  reconfiguration 
scheme  called  FUSS  (Full  Use  of  Suitable 
Spares)'  uses  an  indicator  vector  called  a 
surplus  vector  to  guide  the  replacement  of 
faulty  cells  in  an  FTPA.  In  its  most  general 
and  ideal  case.  FUSS  achieves  100-percent 
array  survivability.  Simple  FUSS  (or 
FUSS-C),  a  practical  instance  of  FUSS, 
can  achieve  up  to  99-percent  probability 


of  survival. 

In  FUSS-C,  the  FTPA  is  an  M  x  (/V-KT) 
array  in  which  C  is  the  number  of  spare 
columns  (spare  rows  are  not  used).  Basi¬ 
cally,  the  algorithm  can  be  divided  into 
four  steps:  array  preprocessing,  surplus 
normalization,  fault  shifting,  and  cell  in¬ 
terconnection.  This  subdivision  is  done  for 
clarity  as  well  as  for  monitoring  the  algo¬ 
rithm  progress. 

In  array  preprocessing,  the  surplus  vec¬ 


tor  is  computed.  Let  f.  be  the  number  of 
faulty  ceils  in  row  i.  The  surplus  vector  is 
defined  3S 

s  =  U,,  s . . rwJr 

where 

5  =  z  (C  -/;)  =  ci  -  I  / 

*  i  ;  -  I 

is  the  surplus  of  row  t. 


January  1990 


63 


Figure  14.  Comparison  of  array  survivabilities  (20  x  20). 


Figure  13.  An  example  of  FUSS-1:  (a) 
initial  array  A;  (b>  array  B  after  shift; 
(c)  final  reconfigured  array.. 


(1)  If  s?Q,  then  the  sum  of  spares  in 
rows  1  through  i  is  greater  than  the  number 
of  faulty  cells  in  rows  1  through  i;  thus,  rose 
i  has  extra  celts  available  for  use  by  faulty 
cells  in  rows  <+l,  (+2, ....  At. 

(2)  It  f  < 0.  then  row  i  has  a  deficit  and 
needs  to  use  available  cells  from  rows  Wv 
i+2, ....  At. 

(3)  If  s*<0.  then  the  number  of  spares  is 
less  than  the  number  of  faults.  In  this  case, 
the  array  is  not  reconflgurabie  and  the 
algorithm  must  exit  with  failure. 

In  surplus  normalization,  the  surplus 
vector  is  recomputed  so  that  fault  shifting 
can  be  simplified.  For  instance,  when 
sM< 0 .  sM  extra  cells  are  available  (or  use  by 
“imaginary”  faulty  cells  in.  nonexistent 
rows  At+lM+2, ... .  Shifting  these  faulty 
cells  to  available  cells  in  row  At  would  be 
required  by  the  rules  used  in  the  fault- 
shifting  phase  of  FUSS.  A  better  solution 
is  to  make  rw  »  0.  which  is  one  of  the 
objectives  of  surplus  normalization. 

Fault  shifting  is  the  logical  replacement 
of  unavailable  cells.  In  FUSS-C.  an  un¬ 
available  ceil  (i,j)  can  be  (1)  shifted  down 
to(i+l.j)  if  s.  is  negative,  or (2)  shifted  up- 
to(i-l.y)  iff,  is  positive.  After  each  shift, 
the  corresponding  entry  in  the  surplus 


vector  is  readjusted  toward  zero;  that  is. 
one  is  added  to  s,  in  case  (1),  and  one  is 
subtracting  from  jm  in  ease  (2).  The  shift¬ 
ing  proceeds  until  the  surplus  vector  be¬ 
comes  null.  Its  effect  can  be  described  as  a 
cell  migration  from  regions  having  moat 
faulty  cella  to  regions  having  less  faulty 
cells. 

Ceil  interconnection  is  the  construction 
of  the  logical  array.  Logical  rows  and  col¬ 
umns  are  formed  one  at  a  time,  using  a 
“status  matrix”  obtained  from  the  fault- 
shifting  phase. 

The  following  example  illustrates  a 
FUSS-C  reconfiguration.  Figure  13a  is  a 
matrix  A  representing  a  4  x  (4+2)  array  (4 
rows,  4  columns,  2  spare  columns)  where 
entry  “0”  represents  a  good  cell  and  “1”  a 
faulty  cell  (column  and  row  indices  are 
added  for  convenience).  This  array  is 
equivalent  to  that  used  in  OR  or  CFS  (see 
Figure  10)  except  that  the  spare  row  be¬ 
comes  another  spare  column.  The  surplus 
vector  obtained  after  array  preprocessing 
is  augmented  to  the  matrix.  Figure  13b 
shows  the  matrix  obtained  after  fault  shift¬ 
ing.  The  reconfiguration  is  executed  as 
follows: 

(1)  Scan  the  array  (Figure  13a)  down¬ 
ward.  When  f  <0,  shift  a  number  equal  to 
Is!  of  unavailable  ceils  to  row  i+l  and. 
when  successful,  reset  f  to  0.  in  the 
example.  s}  *  -l.  Thus,  one  ceil  must  be 
shifted  down  from  row  2  to  row  3:  ceil  (2.2) 


is  shifted  down  to  cell  (3,2),  which  is  now 
assigned  a  status  code  of  3  (Figure  13b);  s, 
is  reset  to  0. 

(2)  Scan  the  array  upward.  When  f>0. 
shift  Ifl  unavailable  cells  in  row  i+l  to 
available  cella  in  row  i;  s,  is  reset  to  0  when 
ail  If!  cells  are  shifted  successfully.  In  the 
example,  r,  =  1.  Thus,  one  cell  must  be 
shifted  from  row  4  to  row  3:  cell  (4.4).  the 
only  possible  choice,  is  shifted  up  to  cell 
(3.4),  which  assumes  the  status  code  of  2; 
j,  is  readjusted  tnO 

At  this  point,  the  surplus  vector  is  null, 
which  means  that  fault  shifting  is  success¬ 
ful.  Also,  a  status  matrix  Bis  obtained  and 
shown  in  Figure  13b;  The  matrix  entries 
(denoted  by  b,j)  are  status  codes  that  guide 
the  cell-interconnectioa  phase  of  FUSS. 
Entry  b.  .  has  the  following  meaning:  b;  f  * 
0  if  (i.j)  is  fault  free;  b:  j  *  l  if  (i,y)  is  faulty: 
b.  j »  2  if  (t,  j)  replaces  (i+l./);  andb.  , »  3 
if  (i,/)  is  replacing  (i-1 ,/);  Since  the  status 
of  each  ceil  is  known,  it  is  easy  to  automati¬ 
cally  derive  how  ceils  are  interconnected. 
Afterwards,  the  corresponding  logical  ar¬ 
ray  is  realized.  Figure  13c  shows  where 
logical  coordinates  are  given.  Figure  1 1 
shows  the  actual  reconfigured  array  where 
logical  cells  along  the  rowa  are  connected. 
For  FUSS-1,  the  interconnect  require¬ 
ments  are  shown  in  Figure  12.  where  solid 
lines  are  used  to  connect  logical  columns 
and  dashed  Lines  are  used  to  connect  logi¬ 
cal  rows.  This  same  switch-bos  structure 


64 


COMPUTER 


Table  2.  Probability  of  survival  of  selected  reconfiguration  schemes  foe  29  x  29 
arrays  (partially  from  Jervis,  Lombardi,  and  Sciuto,  and  from  Lombardi  and 
Scioto  in  “Further  reading”). 


Reconfiguration  Scheme  Algorithm  Complexity  Survivability  When 

p  *  .50  p  =  .75  p  *  90 


Direct  reconfiguration12 

0(N) 

0.43 

0.35 

0.27 

Fixed  fault  stealing12 

0(  AO 

0.47 

0.38 

0.30 

Complex  fault  stealing12 

0(N) 

0.90 

0.84 

0.77 

Modified  index  mapping** 

0(N*) 

0.67 

0.62 

0.54 

Modified  optimal  reconfiguration 
algorithm  (MORA) 

(see  Lombardi  et  al.**) 

O(NlogN) 

0.90 

0.84 

0.78 

Spanning  tree-based***' 

O(NloglogN) 

0.93 

0.89 

0.78 

Orthogonal  mapping** 

0{tP) 

0.93 

0.88 

0.82 

Simple  full  use  of  suitable  spares 

(FUSS)7 

0(N*) 

1.00 

1.00 

0.99 

MV  Sm  (ha  "ProcttMf-iwitekint  tadmiqaaa."  sacooo  of  “Flutter  reading." 
**#  Sm  tte  "Local-redundancy  rechniq— ' *  lection  of  "Punter  reading." 


can  be  used  for  the  complex  fault-stealing 
scheme. 

In  summary,  the  probability  of  survival 
of  algorithms  in  this  class  improves  over 
the  probability  of  survival  of  those  in  the 
set-exclusion  class.  Thus,  fewer  spare  cells 
are  being  wasted.  However,  the  algorithms 
become  more  complex,  and  the  mtercon- 
nect-link  requirements  increase.  Figure  14 
graphs  typical  values  of  the  probability  of 
survival.  Table  2  summarizes  survivabili¬ 
ties  of  all  schemes  discussed  here  (or 
mentioned  in  “Further  reading”). 


conquer  technique*  consists  of  procedures 
for  restructuring  two-dimensional  as  well 
as  linear  WSI  arrays.  It  focuses  on  achiev¬ 
ing  high  reconfiguration  harvest  with  shore 
interconnect  wires  between  logically  adja¬ 
cent  cells.  To  construct  a  two-dimensional 
array  out  of  functional  cells,  the  array  is 
recursively  bisected  and  the  number  of  live 
cells  in  each  half  is  counted.  Based  on  the 
number  of  live  cells  in  each  half,  the  algo¬ 
rithm  computes  the  dimensions  of  the  two 


subarrays  that  must  be  constructed  and- 
then  recursively  constructs  the  subarrays. 
Each  subarray  is  small  enough  to  be  con¬ 
structed  by  inspection.  The  subarrays  are 
then  linked  to  form  the  complete  array,  Io 
the  worst  case,  this  strategy  costs  polyno¬ 
mial  time  and  space.  The  worst-case  com. 
plexity  of  the  interconnect  wires  grows. 
with  the  array  size. 

Minimization  of  the  maximum  wire 
length  and  the  channel  width  (number  of 
wires  between  two  rows  or  two  columns) 
can  be  achieved  by  constructing  the  array 
out  of  not  ail  of  the  fault-free  cells  but  a 
fraction  of  them.  In  other  words,  the  wire 
length  is  further  reduced  if  a  degraded 
array  harvest  is  acceptable.  This  means, 
that  some  isolated  fault-free  ceils  in  pre¬ 
dominantly  defective  regions  are  not  used 
in  the  reconfiguration. 

Interstitial  redundancy.  The  intersti¬ 
tial-redundancy  scheme1  is  a  good  illustra¬ 
tion  of  a  local-redundancy  technique.  The- 
objective  of  the  reconfiguration  strategies, 
is  to  achieve  fair  area  efficiency  by  using 
as  many  good  cells  as  possible,  while  also 
ensuring  shorter  intercell  communication- 
links.  Here,  spare  cells  are  assigned  to 
replace  failed  primary  cells  in  the  prear¬ 
ranged  clusters  of  a  two-dimensional  ar¬ 
ray.  For  instance,  in  a  25-percent  redun¬ 
dancy  array,  as  shown  in  Figure  15,  every 
spare  (squares  labeled  S>  is  assigned  to  a 
cluster  of  four  primary  cells.  Each  duster 
is  independent  and  can  tolerate  a  faulty 
ceil.  Since  spares  are  physically  close  to> 
the  cell  they  replace,  restructured  intercon- 


Local-redundancy 

class 

The  allocation  of  redundancy  of  the 
reconfiguration  schemes  in  this  clean  in 
distributed  and  localized  in  pans  of  am 
array.  As  mentioned  earlier,  this  is  equiva¬ 
lent  to  partitioning  or  systematic  redaction 
of  an  array  to  smaller  subamys.  each  of- 
which  can  be  reconfigured  independently. 
A  common  objective  of  the  schemes  in  this 
clan  is  the  minimization  of  the  intercon¬ 
nection  delays  (that  is.  length  of  the  long¬ 
est  wires  between  logically  adjacent  cells). 
Two  typical  schemes  are  discussed  here; 
one  illustrates  well  the  partitioning  charac¬ 
teristic.  and  the  other  is  a  typical  local 
redundancy  scheme. 

Dhride  and  conquer.  l$e  divide-amt- 


Ftgnre  IS.  A  peaalbla  layout  far  the  ieterititial  redundancy  approach  to  re¬ 
configuration.’ 


January  1999 


6ft 


necuons  are  fixed  and  short.  This  results  in 
low  intercell  communication  delays,  thus 
minimizing  performance  degradation.  For 
higher  redundancy  arrays  (SO  percent.  100 
percent,  etc.)  spares  can  also  be  shared 
among  cells  in  different  clusters.  For  a 
wide  range  of  array  sizes,  estimations  of 
yield  for  arrays  with  different  amounts  of 
redundancy  show  that  approximately  SO* 
percent  use  of  fault-free  cells  on  the  chip 
can  be  achieved. 

Hierarchical  redundancy.  As  men¬ 
tioned  earlier,  in  local-redundancy 
schemes,  reconfiguration  takes  place  lo¬ 
cally  in  each  block  of  an  array  partition.  If 
reconfiguration  is  not  possible  within  a 
block,  the  system  will  fail  unless  the  faulty 
block  can  be  replaced  by  a  functional  one. 
From  a  structural  point  of  view,  hierarchi¬ 
cally  combining  two  or  more  reconfigura¬ 
tion  strategies  is  possible  and  practical  in 
some  arrays.  Typically,  these  arrays  are 
themselves  organized  in  a  hierarchical 
way.  For  example,  a  two-level  hierarchical 
FTPA  is  an  array  of  small  subarrays.  A 
multilevel  hierarchical  FTPA  can  be  de¬ 
fined  recursively,  that  is.  as  an  array  of 
multilevel  subarrays.  Different  reconfigu¬ 
ration  schemes  can  be  applied  to  different 
levels  of  FTPA  physical  structures,  with 
the  overall  array  reconfiguration  schemes 
having  a  hierarchical  nature.  In  other 
words,  if  a  faulty  processor  cannot  be  re¬ 
placed  within  a  subarray  (using  a  given 
reconfiguration  technique)  that  subarray  is 
declared  faulty  and  possibly  replaced  by  a 
spare  one  (possibly  using  a  different  recon¬ 
figuration  technique). 

Several  examples  of  hierarchical 
schemes  have  been  proposed  and/or  imple¬ 
mented;  additional  references  are  given  in 
the  local-redundancy  techniques  section 
of  “Further  reading.”  A  typical  one  is  the 
CHiP  (configurable,  highly  parallel)  archi¬ 
tecture10  made  up  of  building  blocks,  each 
of  which  is  a  two-dimensional  CHiP  array. 
The  reconfiguration  scheme  for  the  high 
level  (array  level)  is  implemented  by  set 
switching  and  that  of  low  level  (block 
level)  by  processor  switching.  The  CHiP 
architecture  is  highly  flexible  and  the 
switch  lattice  (connection  mechanism )  can 
provide  substantial  fault  tolerance  in 
VLSI/WSI  implementation.  However,  the 
area  complexity  of  the  switch  is  high. 

The  analysis  and  design  of  hierarchical 
FTPA  structures  was  considered  in  Wang 
and  Fortes."  It  shows  that  hierarchical 
architectures  can  provide  much  higher 
reliability  than  single  level,  especially  in 
the  case  of  very  large  arrays. 


Time-redundancy 

approach 

The  time-redundancy  approach  to  fault- 
tolerant  systems  has  oeen  used  for  yield 
enhancement,  reliable  computation,  and 
self  testing,  ideally,  a  time -redundancy 
reconfiguration  scheme  uses  only  logical 
spare  ceils,  that  is.  physical  cells  that  per¬ 
form  their  own  functions  as  well  as  the 
functions  of  the  faulty  ones.  However,  it 
can  also  include  techniques  that  use  some 
(limited)  physical  spare  ceils  in  the  recon¬ 
figuration.  Time-redundancy  techniques 
can  also  be  used  to  support  graceful  degra¬ 
dation.  because  they  can  prevent  system 
failure  after  spates  are  exhausted  while 
reducing  the  computational  speed  of  the 
overall  system.  A  time-redundancy 
scheme  can  use  a  limited  number  of  cells  to 
perform  operations  that  would  require 
many  more  cells  in  a  hardware-redundancy 
scheme. 

Sami  and  Stefanelli1*  presented  a  typical 
time-redundancy  technique  that  involves 
repeated  use  of  functional  cells  and  re¬ 
quires  two  phases  and  three  clock  cycles. 
Each  cell  is  associated  with  two  logical 
cells.  During  the  first  phase,  the  cell  exe¬ 
cutes  the  functions  of  the  first  logical  cell; 
during  the  second  phase,  it  executes  the 
functions  of  the  second  logical  ceil.  The 
redundant  hardware  elements  required  by 
the  techniques  are  much  less  than  those 
required  by  hardware  approaches  dis¬ 
cussed  previously.  However,  the  process¬ 
ing  speed  is  lowered. 

SRE  and  ARCE.  The  successive  row 
elimination  (SRE)  and  alternate  row  col¬ 
umn  elimination  (ARCE)  schemes11  are 
examples  of  graceful  degradation 
schemes.  In  SRE,  a  row  is  eliminated  if  it 
contains  one  or  more  faults  (set  switching), 
and  in  ARCE  either  a  row  or  a  column  that 
contains  one  or  more  faulty  elements  is 
bypassed.  These  schemes  are  similar  to  the 
Kuo-Fuchs  scheme  presented  earlier,  ex¬ 
cept  that  no  spare  rows  or  spare  columns 
are  used  and  they  are  complemented  by 
algorithm  reconfiguration  techniques  that 
allow  an  algorithm  designed  for  a  proces¬ 
sor  array  of  a  given  size  to  run  on  a  proces¬ 
sor  array  of  smaller  size.  Rows  and  col¬ 
umns  can  be  theoretically  eliminated  until 
one  row  and  one  column  are  left  in  the 
array.  The  performance  of  the  algorithms 
is  allowed  to  degrade  gracefully;  for  ex¬ 
ample.  with  up  to  (Nt 2)- 1  faults,  SRE  al¬ 
lows  the  array  to  operate  at  haJf  the  speed 
of  the  fully  operational  array. 


In  summary,  time-redundancy  ap¬ 
proaches  to  array  reconfiguration  might 
require  less  hardware  but  demand  longer 
processing  time  than  hardware-redun¬ 
dancy  techniques.  These  techniques  can  be 
particularly  useful  in  some  architectures, 
such  as  systolic  arrays,  where  idle  cycles 
can  be  used  by  logical  spares  to  perform 
functions  of  the  faulty  cells.  In  addition, 
the  time -redundancy  approach  can  be 
considered  when  decreasing  speed  is  an 
acceptable  form  of  degradation. 

Concluding  remarks- 

The  proposed  taxonomy  for  classifying 
different  reconfiguration  schemes  for 
processor  arrays  can  be  used  to  compare 
and  contrast  many  possible  approaches  in 
light  ofbasic  orthogonal  characteristics.  In 
addition,  we  hope  the  different  schemes 
presented  to  illustrate  the  taxonomy  will 
help  interested  engineers  and  researchers 
understand  existing  approaches  and  de¬ 
velop  new  ones. 

An  important  yet  difficult  question  to  be 
faced  by  designers  is  that  of  deciding 
which  of  the  many  possible  schemes  is  the 
best  for  their  application.  The  ultimate 
benefit  of  reconfiguration  schemes  is  yield 
improvement  (in  the  case  of  static  recon¬ 
figuration)  or  reliability  increase  (for  dy¬ 
namic  reconfiguration). 

A  designer  must  be  able  to  evaluate  the 
impact  of  a  scheme  on  yield  and/or  relia¬ 
bility.  and  these  are  complex  functions  of 
technology,  application,  scheme  surviva¬ 
bility,  and  scheme  overhead.  The  latter 
factor  includes  the  hardware,  software,  and 
time  necessary  to  implement  the  sc  heme.  It 
can  be  the  cause  for  faults  and  performance 
degradation  that  mitigate  the  survivability 
of  the  scheme.  Reliability  and  yield  mod¬ 
els  must  account  not  only  for  processing 
elements  but  also  for  connections, 
switches,  reconfiguration  logic,  and  other 
hardware  and  software  mechanisms. 

Much  research  is  currently  under  way  to 
identify  valid  defect  and  fault-distribution 
models  and  to  develop  models  and  tools  for 
reliability  and  yield  evaluation.  Studies  by 
the  authors  and  other  researchers  dearly 
show  that  there  is  no  universally  ideal 
reconfiguration  scheme.  For  this  reason, 
identifying  analytical  approaches  to  the 
evaluation  of  survivability,  reliability,  and 
yield  has  become  an  important  research 
direction.  Good  results  in  this  are»  will 
support  a  deeper  technical  understanding 
of  reconfiguration  schemes  and  systematic 
derivation,  analysis,  and  evaluation  of  new 


66 


COMPUTER 


Tabic  3.  Comparison  of  reconfiguration  classes  be  terms  ai  GEAL  (simplicity, 
•ffiocncy,  area,  locality)  figure  of  merit. 


S  cr:me 

Simplicity 

Efficiency 

Area 

Locality 

:  switching 

Good 

Poor 

Poor 

Good 

i-rocessor  switching 

Fair 

Good 

Fair 

Good 

Local  redundancy 

Good 

Fair 

Fair 

Good 

Time  redundancy 

Poor 

Good 

Good 

Good 

approaches.  This  will  help  explain  empiri¬ 
cal  approaches  and  replace  ad  hoc  solu- 
:ions  by  fundamental  techniques. 

As  mentioned  above,  it  is  difficult  to 
make  universal  statements  about  the  rela¬ 
tive  advantages  of  the  different  schemes, 
namely  the  set-switching  class,  processor- 
switching  class,  local-redundancy  class, 
and  time-redundancy  class.  However,  in 
Table  3,  we  attempt  to  compare,  in  general 
terms,  the  five  major  classes  in  relation  to 
the  SEAL  (simplicity,  efficiency,  ana. 
locality)  figures  of  merit.  Some  general 
issues  facing  the  schemes  are  also  briefly 
reviewed  below. 

From  a  redundancy  point  of  view,  the 
local-redundancy  class  of  reconfiguration 
schemes  aims  at  achieving  short  wires, 
thus  short  communication  delays,  for  in¬ 
terconnecting  logically  adjacent  cells  in 
the  reconstructed  arrays.  This  is  accom¬ 
plished  by  either  physically  assigning 
spare  cells  to  a  group  of  cells  or  by  system¬ 
atically  partitioning  the  array.  Minimizing 
the  win  lengths  necessitates  some  waste  of 
spare  cells,  which  can  result  in  suboptimal 
reconfiguration  harvest  or  underutilization 
of  spares. 

The  waste  of  redundant  resources  can  be 
substantial  in  the  set-switching  class,  of 
reconfiguration  schemes  because  entire 
sets  of  cells  are  replaced.  In  set  switching, 
interconnection  and  control  circuitry  are- 
much  simpler  than  those  in  other  classes. 
For  this  reason,  schemes  could  be  easily 
implemented.  On  the  other  hand,  the  waste 
of  spare  ceils  makes  the  techniques  in  this 
class  impractical  in  some  application*. 

The  processor-switching  class  of  recon¬ 
figuration  techniques  aims  at  minimizing 
the  longest  wire  length  connecting  two 
logically  adjacent  cells  and.  at  the  same 
time,  minimizing  the  waste  of  spares.  This 
requires  more  complex  algorithms  and 
more  involved  interconnect  and  control 
reconfiguration  circuitry.  Existing 
schemes  demonstrate  better  spare  use  (than 
those  in  the  two  previously  discussed 
classes)  with  polynomial  reconfiguration 


time  complexity.  However  designers  must 
carefully  explore  the  design  of  intercon¬ 
nection  and  control  hardware  before  effi¬ 
cient  implementation  can  be  realized. 

Finally,  when  performance  degradation 
is  acceptable,  physical  redundancy  can  be 
replaced  by  time  redundancy  to  improve 
the  silicon  are*  in  FTP  As.  Reconfiguration 
schemes  that  employ  this  strategy  are 
grouped  in  the  time-redundancy  class. 
Here,  cells  execute  not  only  their  tasks,  but 
also  the  tasks  of  their  faulty  neighbors.  For 
this  reason,  these  schemes  can  result  in 
degraded  computational  speed. 


We  have  overviewed,  character¬ 
ized.  and  classified  some  typi¬ 
cal  reconfiguration  schemes  in 
light  of  a  proposed  taxonomy.  This  taxon¬ 
omy  can  be  used  as  a  guide  for  future 
research  in  design  and  analysis  of  recon¬ 
figuration  schemes. 

Studying  how  to  evaluate  fault-tolerant 
arrays  and  how  to  exploit  application  char¬ 
acteristics  to  achieve  dependable  comput¬ 
ing  are  important  complementary  direc¬ 
tions  of  research  towards  reliable  proces¬ 
sor-array  design. 

A  related  research  problem  is  that  of 
functional  reconfiguration,  thu  is.  leanr¬ 
ing  how  to-  configure  the  topology  of  a 
parallel  system  to-  implement  a  different 
function  or  run  adifferent  application.  Thi» 
topic  transcends  the  scope  of  this  articles 
but  it  is  worthwhile  stating  that  result*  it*- 
this  area  are  relevant  and  useful  to-  the 
problem  of  reconfiguration  for  fault  toler¬ 
ance  and  vice  versa. 

fn  fact,  important  directions  of  research 
include  how  to  apply  or  extend  processor- 
array  reconfiguration  algorithms  to  other 
topologies  and  how  to  many  functional 
and  fault-tolerance  reconfiguration-  re¬ 
quirements  and  solutions.  The  Diogenes 
approach  discussed  in  this  article  is.  an 
interesting  case  where  this  goal  is  namnity 
achieved.  • 


Acknowledgements 

We  dunk  the  reviewers  and  Computer  Edi¬ 
tor-in-chief  Bruce  Shriver  for  their  comments 
and  suggestions  on  bow  to  improve  the  organi¬ 
zation  and  content  of  this  article.  In  addition, 
dismtnocu  with  Noe  Lopez-Benicez  and  Yi- 
XiangWang  helped  enhance  the  article.  We  also 
thank  Clara  C  Alves  for  her  help  in  preparing 
the  manuscript. 

This  work  was  partially  supported  by  the 
Innovative  Science  and  Technology  Office  of 
the  Strategic  Defense  Initiative  Organization 
and  was  administered  through  the  Office  of 
Naval  Research  under  contracts  No.00014-85- 
k-0588  and  No.00014-88-k-Q723. 


References 

1.  J-A.  Abraham  etal.,  “Fault  Tolerance  Tech¬ 
niques  for  Systolic  Arrays,"  Computer . 
Vol.  20.  No.  7.  July  1987.  pp.  63-73. 

2.  H.T.  Kung  and  M.  Lam.  "Wafer-Scale  Inte¬ 
gration  and  Two-Level  Pipelined  Imple¬ 
mentations  of  Systolic  Arrays."  J.  Parallel 
and  Distributed  Computing,  Vol.  1.  Issue  l. 
1984.  pp.  32-63. 

3.  I.  Koren.  "A  Reconfigurabie  and  Fault- 
Tolerant  VLSI  Multiprocessor  Array." 

-  Proc.  8th  Symp.  Computer  Architecture, 
CS  Press.  Los  Alamitos.  Calif..  Order  No. 
346  (microfiche).  1981.  pp.  423-442. 

4.  S.-Y.  Kuo  and  W.K.  Fuchs.  "Efficient 
Spare  Allocation  for  Reconfigurabie 
Arrays."  IEEE  Design  A  Test.  Feb.  1987. 
pp.  24-31. 

3.  AJL.  Rosenberg.  “The  Diogenes  Approach 
to  Testable  Fault-Tolerant  Arrays  of 
Processors."  IEEE  Trans.  Computers.  VoL 
C-32.  Oct.  1983.  pp.  902-910. 

6.  R.  Negrini.  VL  Sami,  and  R.  Ste  fane Uk. 
"Fault-Tolerance  Techniques  for  Array 
Structures  Used  in  Su percomputin g."  Com¬ 
puter.  VoL  19,  No.  2.  Feb.  1986,  pp.  78-87. 

7.  M.  Chean  and  J.A.B.  Fortes.  “FUSS:  A-< 
Reconfiguration  Scheme  for  Fault-Toler¬ 
ant  Processor  Arrays.”  Inf  l  Workshop* 
Hardware  Fault  Tolerance  in  Multiproces¬ 
sors,  June  1989,  pp.  30-32. 

8.  T.  Leighton  and  C.E.  Leiscnon.  “Wafer- 
Scale  Integration  of  Systolic  Anaya."  IEEE 
Trans.  Computers.  Vol.  C-34.  May  198&. 
pp.  448-461. 

9.  A.D.  Singh.  “Interstitial  Redundancy:  An 
Area  Efficient  Fault-Tolerance  Scheme  for-. 
Large  Area  VLSI  Processor  Arrays."  IEEE 
Trans.  Computers.  Vol.  C-37,  No.  1 1,  Noa 
1988,  pp.  1.398-1.410. 

10.  K.S.  Hedisnd  and  L  Snyder.  “Wafer-Seal* 
Integration  of  Configurable.  Highly  Parab¬ 
le!  (CHIP)  Architectures  and  Proceaeonf" 

6T 


January  1990 


(extended  abstract),  Proc.  1982  Ml  Conf. 
Parallel  Processing,  CS  Press.  Los  Alami- 
tos,  Calif..  Order  No.  421  (microfiche), 
1982.  pp.  262-264. 

11.  Y.-X.  Wang  and  J.A.B, 'Fortes,  “On  the 
Analysts  and  Design  of  Hierarchical  Fault- 
Tolerant  Processor  Arrays,"  Inf  l  Workshop 
Defect  and  Fault  Tolerance  in  VLSI  Sys¬ 
tems,  Oct.  1988.  pp.  347-355. 

12.  M.G.  Sami  and  R.  Stefanelli,  “Fault  Toler¬ 
ance  of  VLSI  Processing  Arrays:  The  Time- 
Redundancy  Approacn."  Real-Time  Sys¬ 
tems  Symp.,  Dec.  1984,  pp.  200-207. 

13.  J.A.B.  Fortes  and  C.S.  Raghavendra. 
“Gracefully  Degradable  Processor  Arrays," 
IEEE  Trans.  Computers,  Vol.  C-34,  Nov. 
1985.  pp.  1.033-1.044. 


Further  reading 

Set-switching  techniques 

A  set-switching  reconfiguration  scheme 
replaces  a  faulty  cell  by  logically  remov¬ 
ing  a  set  of  ceils  (row.  column,  block, 
etc.)  that  contains  the  faulty  cell.  For  this 


Have  you  heard  about  our... 

Microprocessor 
Standards 
work  on 

NuBllSr 
Multibus  ir, 
STEbuSy 
and  more? 

For  information  on  this  or  any 
of  our  80-plus  standards  working 
groups,  members  may  contact 

IEEE  COMPUTER  SOCIETY 

10662  Los  vaoueros  Circle 
Los  Aiamitos.  CA  90720-2578 
(714)821-8380 

Nonmemberst  Circle  reaper  service 

number  202  tor  membership 
mformetion. 


reason,  the  scheme  can  be  easily  imple¬ 
mented;  however,  the  waste  of  functional 
cells  is  large  since  many  of  them  cannot 
be  used.  For  instance,  if  an  array  has  Ft 
spare  rows  and  every  time  a  fault  occurs 
the  row  containing  the  fault  is  replaced  by 
one  spare  row,  as  few  as  Ft*  1  faulty  cells 
on  different  rows  can  make  the  array  use¬ 
less.  For  additional  reading,  see  discus¬ 
sion  in  the  main  text  and  schemes  pro¬ 
posed  in  the  following  references: 


Batcher.  K.E.,  '(Design  of  a  Massively  Parallel 
Processor.*  IEEE  Trans.  Computers,  Vol.  C-29, 
Sept.  I960,  pp.  838-840. 

McCanny,  J.V..  and  J.G.  McWhirter.  "Yield  En¬ 
hancement  of  Slt-Lsvel  Systolic  Array  Chips  Us¬ 
ing  Fauit-Tolarant  Techniques,*  Electronics  Let¬ 
ters,  Vol.  19,  July  1983.  pp.  525-527. 


Ishikawa.  T..  at  at.,  'Hierarchical  Array  Proces¬ 
sor  (HAP)  Featuring  High  Reliability  and  High 
System  Performance.*  1988  Inti  Conf.  Parsitef- 
Processing,  CS  Press,  Los  Aiamitos,  Calit.,  Or¬ 
der  No.  724  (microfieha),  Aug.  1986.  pp.  293- 
300. 


Kim,  J.H..  and  S.M.  Raddy,  ‘On  Eaaily  Testable 
and  Recon figurtbie  Two-Oimenstonal  SyatoUc 
Arrays.*  Inti  Conf.  Parsilel  Processing,  CS 
Press.  Los  Aiamitos.  Calif..  Order  No.  783.  Aug. 
1987.  pp.  101-109. 


Hsdlund.  K.S..  "WASP  —  A  Wafer-Scale  Syr- 
tolic  Processor,*  IEEE  Inti  Conf.  Computer  De¬ 
sign,  CS  Prase.  Los  Aiamitos.  Calif..  Order  No. 
642  (microfiche).  Oct.  1985.  pp.  565-67 1. 


Processor-switching  techniques 


The  replacement  strategy  used  in 
these  schemes  proceeds  in  a  chain  fash¬ 
ion  such  that  a  faulty  cell  is  replaced  by 
(shifted  to)  an  immediate  neighbor,  this 
neighbor  is  replaced  by  its  neighbor,  and 
so  on  until  the  spare  cell  is  reached.  The- 
sequence  of  replacements  is  often  called- 
a  replacement  chain  or  a  compensation 
path  (see  Kung,  Jean,  and  Chang  in  the 
'Local  redundancy  techniques*  section, 
below).  In  a  more  complex  scheme,  a 
faulty  cell  can  also  be  shifted  to  a  distant 
neighbor.  Many  schemes  have  been  pro¬ 
posed.  Some  are  discussed  in  the  main 
text,  and  others  are  described  in  Kuo  and 
Fuchs*  and  the  following  references; 

Clmmtere.  L..  C.  Demerttnf,  and  A.  Vaienzano. 
Defect-Tolerant  Array  Structures  for  VLSI  and 
WS1  Architectures.”  Pro c.  20th  Hawed  Inti  Coot. 
System  Science.  1987.  pp.  21-30. 

Jervis.  L.  F  Lombardi,  and  0.  Sciuto.  ‘Orthogo¬ 
nal  Mapping:  A  Reconfiguration  Strategy  for 
Fault- Tolerant  VLSI/WSI  2-0  Arrays*  Inti  Work¬ 
shop  Detect  end  Feuft  Tolerance  in  VLSI  Sys¬ 
tems  Oct.  1988,  pp.  309-318. 

Lombardi.  F  ,  R.  Nagrtni.  and  R.  StefanalS.  *Re- 
eonflguradon  of  VLSI  Arrays:  A  Covering 
Approach.*  iTrh  inti  Symp.  Psuh-Tolerent  Com¬ 
puting  System*.  CS  Prim.  Los  Alemrtoe,  Catft.. 


Order  No.  778  (imoroflcha).  1987,  pp.  251-256. 

Oistante.  F.,  M.G.  Sami,  ard  R.  Stefanelli.  *A 
General  index  Mapping  Tacftmque  lor  Array 
Raconfiguration,*  Inri  Symp.  Circuits  and  Sys¬ 
tem*  June  1988.  pp.  559-563. 

Yoon,  H.Y.,  and  A.O.  Singh,  ‘A  Highly  Efficient 
Design  for  Reconfigunng  the  Processor  Array  in 
VLSI,*  Inti  Conf.  Parallel  Processing,  CS  Press. 
Los  Aiamitos.  Calif..  Order  No.  889  (microfiche). 
Aug.  1988.  pp.  375-382. 


Local-redundancy  techniques 

In  these  schemes,  a  given  spare  pro¬ 
cessor  can  be  used  to  replace  only  one 
of  a  predetermined  subset  of  cells  of  the 
array,  typically  in  physical- proximity  to  the 
spare.  Many  of  these  schemes  are  imple¬ 
mented  in  a  hierarchical  array  structure 
(see  Ishikawa  in  the  ‘Set-switching  tech¬ 
niques*  section;  and  see  Jesahope  and 
Bentley,  Reeves,  and  Hwang  and 
Raghavendra  below).  The  “partitioning 
scheme*  (see  Kung,  Jean,  and  Chang  be¬ 
low)  and  the  ‘spanning  tree*  scheme  aro 
similar  to  Leighton  and  Letsersons  di- 
vide-and-conquer  strategy.*  The  span¬ 
ning  tree  approach  chooses  to  partition 
an  FTPA  along  the  array  diagonal,  and 
the  subarray  construction  is  implemented 
following  a  spanning  tree  aigonthm. 

Other  schemes  are  somewhat  similar  to 
the  CHiP  approach  discussed  in  the  text. 
References  for  these  schemes  are  as  fol¬ 
lows: 

Jssahope.  C..  and  L.  Bentley.  "Tecnmques  for 
Implementing  Two-Dimensional  Wafer-Scale 
Procaaaor  Arrays.*  IEE  Proc.,  Vol.  134.  Pt.  E. 
Mar.  1987.  pp.  87-92. 

Raaves.  A.P..  "Fault  Toiaranca  in  Highly  Paral¬ 
lel  Mesn-Connectsd  Processors.’  in  Computing 
Structure  for  Imege  Processing,  M.J.8.  Duff, 
ed..  Academic  Prase.  London.  1983.  pp.  77-90. 

Hwang.  J.H.,  and  C.S.  Ragltavandra.  "VLSI  Im¬ 
plementation  of  Fault-Tolerant  Systolic  Arrays,* 
IEEE  Inti  Conf.  Computer  Design,  CS  Press, 

Loa  Aiamitos.  Calit..  Order  No.  735.  1986.  pp. 
110-113. 

Kung,  S.Y..  S.N.  Jean,  and  C.W:  Chang, ‘Fautt- 
Toieram  Array  Processors  Using  Single-Track 
Switches.*  IEEE  Tran *  Computer*  Vol.  38.  No. 
4.  Apr.  1989.  pp.  501-514. 

Lombardi.  F..  and  D.  Sciuto.  'Reconfiguration  in 
WSI  Arrays  Using  Minimum  Spanning  Trass.* 
Como  Euro  87.  CS  Prssa.  Loa  Aiamitos.  Calit:. 
Ordsr  No.  773.  May  1987.  pp.  547-550. 


Time- redundancy  approaches 

In  these  schemes,  the  redundant  hard¬ 
ware  incorporated  into  the  FTPA  is  lim¬ 
ited  to  circuitry  that  enables  functional 
cells  to  execute  repeated  operations. 
Time-redundant  schemes  other  then 
those  discussed  in  the  text  can  be  found 


63 


COMPUTER 


•n  the  first  three  references  listed  below. 
~ime-redundancy  approaches  have  also 
oeen  used  extensively  for  error  detection, 
is  discussed  in  the  final  two  references 
isted  below. 

NegrM.  R..  and  R.  StefanelU.  'Comparative- 
evaluation  of  Space*  and  Time-Redundancy 
Approaches  for  WSI  Processing  Arrays,*  in  Mte- 
for-Scele  Integration.  G.  Saucier  and  J.  Trilhe, 
ads.,  Bsevier  Science  PuOHshers  S.V.  (North 
Holland).  1966.  pp.  207*222. 


rones.  ‘Algorithm  Reconfiguration  Tech¬ 
niques  lor  Gracetully  Oegradahle  Processor  Ar¬ 
rays."  in  Systolic  Arrays.  W.  Moore,  A.  McCabe, 
and  R.  Urquhart.  ads..  Adam  Hilgsc.  1966,  pp. 
259-268. 


Siewioralt.  D.P..  and  R.S.  Swan,  The  Theory 
and  Practice  of  PeHeUe  System  Design.  Oi&at 
Press.  Bedford.  Mass..  1982. 


Johnson,  B.W..  Design  and  Analysis  of  Feu*- 
Tolerant  Digital  Systems.  Addison- Wesley. 
Readng,  Mass.,  1989. 


Negrmi.  R..  and  R.  StefanelH.  Time  Redun¬ 
dancy  in  WSI  Arrays  of  Processing  Elements.’ 
Pro c  mrt  Coat.  Supercomputing  Systems.  CS 
Press.  Los  Alamitos.  Calif..  Order  No.  65*  (mi¬ 
crofiche),  1985,  pp.  429-436. 


Mengly  Cheat)  is  an  associate  research  com¬ 
puter  scientist  at  the  Beilsire  Research  Center, 
Shell  Development  Company.  Houston,  Texas. 
He  served  as  a  graduate  instructor  and  research 
assistant  at  Purdue  University,  "'eat  Lafayette, 
Indians,  from  1984  to  1989.  Hi.<  research  inter¬ 
ests  include  computer  architecture,  fault-toler¬ 
ant  computing,  parallel  processing;  and  com¬ 
puter  networks. 

Chean  received  his  BSEE  ini  197  V  from  the 
University  of  Maryland.  He  received  his  MS  EH 
in  1980  and  his  PhD  in  computer  engineering  in- 
1989.  both  from  Purdue;  He  is  a  member  of  the 
IEEE,  the  IEEE  Computer  Society,  Eta  Kappa 
No,  and  Tau  Beta  Pi. 


Fortes,  can  be  contacted  at  the  School  of 
Electrical  Engineering.  Purdue  University, 
West  Lafayette,  IN  47907. 


Jose  A3.  Fortes  has  been  on  the  faculty  of 
Purdue  University  since  1984  and  is  currently 
an  associate  professor  at  the  Purdue  School  of 
Electrical  Engineering.  He  is  on  a  one-year 
leave  through  July  1990  at  the  National  Science 
Foundation,  where  is  serving  as  a  program- di¬ 
rector  for  microelectronic  systems  architecture- 
His  technical  interests  are  in  the  areas,  of  paral¬ 
lel  processing  and  fault-tolerant  computing, 
and  he  has  had  more  than  50  papers  published. 

Fortes  received  MS  and  PhD  degrees  in  elec¬ 
trical  engineering  from  Colorado  State  Univer¬ 
sity  and  the  University  of  Southern  California., 
respectively.  He  is  on  the  editorial  boards  of  the 
Journal  of  Parallel  and  Distributed  Computing. 
and  the  Journal  of  VLSI  Signal  Processing,  is- 
co-chair  of  the  Program  Committee  for  the  1990 
International  Conference  on  Application  Spe¬ 
cific  Array  Processors  (to  be  held  in  Septem¬ 
ber),  and  is  a  member  of  the  IEEE  and  the  IEEE 
Computer  Society. 


Technology  Transfer  to  Go 

The  latest  issue  of  a  computing  journal  ar¬ 
rives  and  you  dread  the  effort  it  will  take  to 
decode  the  information  it  contains.  You’ve 
got  the  opposite  problem  with  the  mass- 
market  computer  magazines:  They’re  easy 
to  read  but  not  very  substantive. 

That’s  where  IEEE  Software  comes  in.  We 
combine  the  advantages  of  peer  review, 
timely  publication,  clear  writing,  and  clean 
design.  We  put  effort  into  the  presentation 
so  you  don’t  have  to  in  the  reading.  Think  of 
us  as  portable  technology  transfer. 

We’re  what  a  technical  magazine  should 
be:  Practical.  Authoritative.  Lucid.  Direct. 

For  subscription  information,  writ*  IEEE  Software, 
10862  Los  Vaqueros  Cir.,  PO  Box  3014,  Los  Alamitos,  CA 
90720-1264;  call  (714) 821-8380;  or  ua«  the  reader-service 
cant 


Software 


Software  Specialist 
Experienced  Compiler  Developer 

The  Software- Tools  Group  in  the  Advanced  Development  Depart¬ 
ment  a  Schlumberger  CAD/CAM  Headquarters  in  Ann  Arbor, 
Michigan,  has  a  M- time  opening.  This  position  requires  academic 
and  practical  experience  in  compiler  writing  and  design.  It  also  en¬ 
compasses  projects  and  tasks  including,  but  not  limited  to,  pro¬ 
gramming  languages  (including  code  generators  and  runtime 
libraries),  utilities,  and  software  engineering  tools. 

You  will  be  working  on  the  design  and  implementation  of  an  op¬ 
timizing,  retargetable  Pascal  compiler  (code  generator).  This  com¬ 
piler  will  incorporate  advanced  techniques  developed  over  the  past 
few  yean  in  the  fields  of  code  generation,  optimization,  register 
allocation;  and  retargetability. 

You  will  be  working  with  state-of-the-art  equipment  and  a 
talented  group  of  system  programmers,  faiowledge  of  Pascal  and  C  « 
languages  is  essential.  A  Master  s  degree,  or  higher,  in  Computer 
Science  is  required.  Extensive  compiler  writing  experience  is  s 
necessary.  Experience  with  UNIX  and  VMS  a  plus 

Schlumberger  CAD/CAM  offers  a  modem,  comprehensive  , 
benefits  package;  In  addition,  you'll  have  access  to  many  valuable 
resources;  the  University  of  Michigan  and  numerous  research 
fatalities  are  located  in  the  area. 

Please  direct  inquiries  to:  William  Builey,  Schlumberger 
CAD/CAM,  4251  Plymouth  Road,  Ann  Arbor.  Michigam  48106  . 
Phone  (313)  995-6211. 


Tbchnolog)— 

CAD/CAM  Division 


REFERENCE  NO.  3 


Shang,  W.,  and  Fortes,  X.  A.  B.,  'Time  Optimal  Linear  Schedules  for  Algorithms  with 
Uniform  Dependencies,"  IEEE  Transactions  on  Computers,  Volume  40,  Number  6, 
June  1991,  pp.  723-743. 

Note  -  This  paper  extends  the  results  of  reference  [1]  to  the  case  when  several  outputs 
must  be  computed  in  the  minimal  amount  of  time. 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL.  <H>.  NO.  0,  JUNE  lyvi 


Time  Optimal  Linear  Schedules  for 
Algorithms  with  Uniform  Dependencies 

Weijia  Shang,  Student  Member,  IEEE,  and  Jose  A.  B.  Fortes,  Member,  IEEE 


Abstract — An  algorithm  can  be  thought  of  as  a  set  of  indexed 
computations  and  if  one  computation  uses  data  generated  by  an¬ 
other  computation  then  this  data  dependence  can  be  represented 
by  the  difference  of  their  indexes  (called  dependence  vector). 
Many  important  algorithm*  are  characterized  by  the  fact  that 
data  dependencies  are  uniform,  L&,  the  raise*  of  the  dependence 
vectors  are  independent  of  the  indexes  of  computations.  Linear 
schedules  are  a  special  class  of  schedules  described  bp  a  linear 
mapping  of  computation  indexes  into  time.  This  paper  addresses 
the  problem  of  identifying  optimal  linear  schedules  for  uniform- 
dependence  algorithm*  so  that  their  execution,  time  is  miniiwfcmi. 
Procedures  are  proposed  to  solve  this  problem  based  o»  the 
mathematical  solution  of  a  nonlinear  optimization  problem.  The 
complexity  of  these  procedures  is  independent  of  the  size  of 
the  algorithm.  Actually,  the  complexity  is  exponential  is  the 
dimension  of  the  index  set  of  the  algorithm  and,  for  all  practical 
purposes,  very  small  due  to  the  limited  dimension  of  the  index 
set  of  algorithms  of  practical  interest.  The  results  reported  in 
this  paper  can  be  used  to  derive  time-optimal  systolic  designs 
and  applied  in  optimizing  compilers  to  restructure  programs  at 
compile-time  in  order  to  maximally  exploit  available  parallelism. 

Index  Terms —  Algorithm  mapping,  data  dependency,  linear 
schedule,  optimizing  compiler,  nested-loop  program,  systolic 
array,  time-optimal. 


I.  Introduction 

A  DIFFICULT  problem  in  the  parallel  execution  of  an 
algorithm  is  the  choice  of  a  schedule  which  results  in 
minimum  execution  time.  This  paper  formulates  and  solves 
this  problem  for  a  particular  class  of  algorithms  and  a  specific 
type  of  schedules  denoted  linear  schedules.  The  algorithms 
under  consideration  are  characterized  by  uniform  data  de¬ 
pendencies  and  unit-time  computations;  they  include  those 
described  by  single  uniform  recurrences  [1]  and  resemble  a 
very  large  number  of  systolic  computations  and  algorithms  de¬ 
scribed  by  programs  with  nested  loops.  The  results  reported  in 
this  paper  can  be  used  to  derive  time-optimal  systolic  designs 
for  those  algorithms  and  can  also  be  applied  in  optimizing 
compilers  [13],  [19]  to  restructure  programs  at  compile-time 
in  order  to  maximally  exploit  available  parallelism. 


Manuscript  received  March  IS.  1989:  revised  January  10.  1990.  This  wort 
was  supported  in  part  by  the  National  Science  Foundation  under  Grant  DC1- 
8419745  and  in  part  by  the  Innovative  Science  and  Technology  Office  of  the 
Strategic  Defense  Initiative  Organization  and  was  administered  through  the 
Office  of  Naval  Research  under  Contracts  00014-85-k-0588  and  00014-88-lc- 
0723. 

W.  Shang  is  with  the  Center  far  Advanced  Computer  Studies.  University 
of  Southwestern  Louisiana.  Lafayette.  LA  70504. 

I.  A  B.  Fortes  it  with  the  School  of  Electrical  Engineering,  Purdue  Univer¬ 
sity,  West  Lafayette,  IN  47907. 

IEEE  Log  Number  91gp073. 


In  simple  terms,  an  algorithm  is  represented  in  this  paper 
as  a  partially  ordered  subset  of  a  multidimensional  integer 
lattice  (called  index  set).  The  points  of  this  lattice  correspond 
to  (or  index)  computations  and  the  partial  order  reflects  the 
data  dependencies  between  them.  These  data  dependencies 
are  represented  as  vectors  that  connect  points  of  the  lattice. 
Informally,  if  a  given  dependence  vector  is  always  present 
when  the  vector  difference  between  any  two  lattice  points 
equals  the  dependence  vector,  then  the  dependence  is  said  to  be 
uniform.  If  all  dependences  are  uniform  then  the  algorithm  is 
said  to  be  a  uniform  dependence  algorithm.  A  linear  schedule 
is  a  mapping  from  the  multidimensional  algorithm  index  set 
into  the  one-dimensional  time  space;  this  mapping  is  expressed 
as  a  linear  transformation  that  involves  the  multiplication  of 
a  vector,  called  linear  schedule  vector,  by  each  and  every 
point  of  the  index  set.  The  image  of  the  index  point  under  the 
mapping  is  the  time  of  execution  of  the  computation  indexed 
by  that  point.  The  purpose  of  this  paper  is  to  show  how  the 
linear  schedule  vector  can  be  determined  so  that  the  algorithm 
can  be  executed  in  a  minimal  amount  of  time  (achievable  by  a 
linear  schedule).  This  algorithm  model  and  (he  notion  of  linear 
schedule  are  easily  related  to  similar  models  and  concepts  used 
in  [1J— [13],  [19],  and  several  other  works. 

A  free  schedule  schedules  computations  to  execute  as  soon 
as  their  operands  are  available.  The  total  execution  time  that 
results  from  using  a  free  schedule  is  the  exact  lower  bound  for 
the  execution  time  of  the  algorithm.  The  difference  between 
the  execution  time  achieved  by  the  flee  schedule  and  the 
execution  time  achievable  by  an  optimal  linear  schedule  is 
bounded  by  a  constant  [1].  In  [16],  a  class  of  algorithms  is 
identified  for  which  that  difference  is  zero,  i.e.,  linear  schedules 
can  be  as  fast  as  the  free  schedule  and  achieve  the  execution 
time  lower  bound  for  this  class  of  algorithms. 

In  practice,  linear  schedules  are  often  characterized  by 
the  fact  that  computations  in  one  or  more  “computational 
wavefronts”  take  place  at  any  given  instant  of  time.  These 
types  of  schedules  have  been  proposed  for  the  execution  of 
many  practical  algorithms  including  the  solution  of  differential 
equations  [20],  the  computation  of  uniform  recurrences  [1], 
matrix  arithmetic  algorithms  [8],  [21],  [22],  digital  signal 
processing  applications  [8],  [22],  [23],  algorithms  described  by 
nested-loop  programs  [19],  [21],  and  systolic  array  algorithms 
[8],  [21].  Linear  schedules  can  be  described  conveniently  by 
using  special  purpose  or  extended  programming  languages 
[8],  [19],  [21],  formal  mathematical  descriptions  [1]— [12],  or 
English- tike  explanations  [23].  In  some  of  these  references, 
optimization  procedures  were  proposed  for  the  determination 


0018-9340/91/0600-0723501.00  <0  1991  IEEE 


724 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL  40,  NO.  6.  JUNE  1991 


of  optimal  linear  schedules;  however,  they  either  have  a 
heuristic  nature  or  have  a  more  restricted  applicability  than,  the 
method  described  in  this  paper.  In  Section  IV,  one  of  these 
procedures,  referred  to  as  linear  programming  approach,  is 
presented  and  compared  to  that  proposed  in  Section  III.  The 
linear  programming  approach  can  be  directly  derived  from 
Lemmas  1-4  in  this  paper  and  is  similar  or  can  be  related 
to  the  approaches  proposed  in  [19],  [27],  [28],  and  other 
references. 

This  paper  is  organized  as  follows.  The  basic  terminology 
and  definitions  used  throughout  the  paper  are  introduced  in 
Section  II  as  well  as  the  formulation  of  the  problem  of  deter¬ 
mining  an  optimal  linear  schedule.  Section.  QI  is  dedicated  to 
the  description  of  the  optimization  procedure  that  finds  the 
time-optimal  linear  schedule  for  any  given  algorithm  with 
uniform  dependencies,  the  proof  of  its  correctness  and  the 
analysis  of  its  complexity.  Section  IV  presents  the  linear  pro¬ 
gramming  approach  and  compares  it  to  the  approach  presented 
in  Section  in.  A  particular  class  of  algorithms  for  which 
the  proposed  solution  is  greatly  simplified  is  considered  in 
Section  V  and  the  corresponding  simpler  optimization  proce¬ 
dure  is  also  provided.  Section  VI  is  dedicated  to  conclusions 
and  future  work. 


II.  Terminology  and  Definitions 

Throughout  this  paper,  sets,  matrices,  and  row  vectors  are 
denoted  by  capital  letters,  column  vectors  are  represented  by 
lower  case  symbols  with  an  overbar,  and  scalars  correspond  to 
lower  case  letters.  The  transposes  of  a  vector  v  and  a  matrix  M 
are  denoted  vT  and  MT,  respectively.  The  symbol  Ei  denotes 
the  row  vector  whose  entries  are  all  zeros  except  that  the  itb 
entry  is  equal  to  unity.  The  vector  T  (or  0)  denotes  the  row 
vector  or  column  vector  whose  entries  are  all  ones  (or  zeroes). 
The  dimensions  of  vectors  T  and  0  and  whether  they  denote 
row  or  column  vectors  are  implied  by  the  context  in  which 
they  are  used.  The  vector  space  spanned  by  a  set  of  vectors 
5  =  {v\,V2,  •  •  • ,  Vk}  is  denoted  sp{Vi,V2,  ■  •  ■  ,77*}  =  sp{S} 
and  its  dimension  (i.e.,  the  number  of  linearly  independent 
vectors  in  5)  is  denoted  dim{5}.  The  symbol  I  denotes  the 
identity  matrix.  The  rank  of  a  matrix  A  is  denoted  rank(A). 
The  set  of  rational  numbers,  the  real  space,  and  the  set 
of  integers  are  denoted  Q,  R,  and  Z,  respectively.  The  set 
of  nonnegative  integers  and  the  set  of  positive  integers  are 
denoted  N  and  jV+,  respectively.  The  empty  set  is  denoted 
0.  The  notation  |5|  represents  the  cardinality  of  set  S.  Let  77 
and  u  be  two  vectors.  Then  v  >u  means  every  component  of 
17  is  greater  than  or  equal  to  the  corresponding  component  of 
u.  As  a  final  remark,  if  x  is  an  element  of  a  set  5,  the  notation 
i  €  S  is  used  and  this  notation  is  also  used  to  indicate  that 
a  column  vector  Tff,-  (or  row  vector  Mf)  is  a  column  (row)  of 
a  matrix  M,  i.e.,  rfij  £  M  (Mi  €  M)  means  rfij(Mi)  is  a 
column  (row)  vector  of  matrix  M. 

The  algorithms  of  interest  in  this  papei  are  the  so-called 
uniform  dependence  algorithms  defined  as.  follows. 

Definition  2.1  (Uniform  dependence  algorithm >;  A  uniform 
dependence  algorithm  is  an  algorithm  that  can  be  described 


by  an  equation  of  the  form 

u(j)  =  ~di),v(j  -d2),  -  ■  ■  ,v(j  -dn))  (2.1) 

where 

1)  J  €  J  C  Zn  is  an  index  point  (a  column  vector),  J 
is  the  index  set  of  the  algorithm,  and  rt  £  N+  is  the 
number  of  components  of  j\ 

2)  g--  is  the  computation  indexed  by  j,  i.e.,  a  single-valued 
function  computed  “at  point  j"  in  a  single  unit  of  time; 

3)  v(j)  is  the  value  computed  “at  J”,  i.e.,  the  result  of 
computing  the  right-hand  side  of  (2.1)  and 

4)  di  £  Zn,  i  =  1,  •  •  • ,  m,  m  £  N  ate  dependence  vec¬ 
tors,  also  called  dependencies,  which  are  constant  (i.e., 
independent  of  J  £  J);  the  matrix  D  —  [di,  •  •  • ,  dm]  is 
called  the  dependence  matrix. 

The  class  of  uniform-  dependence  algorithms  is  a  simple 
extension  of  the  class  of  computations  described  by  uniform 
recurrence  equations  [1].  The  main  difference  is  that  uni¬ 
form  dependence  algorithms  allow  for  different  functions  to 
be  computed  (in  a  unit  of  time)  at  different  points  of  the 
index  set.  From  a  practical  viewpoint,  uniform  dependence 
algorithms  can  be  easily  related  to  programs  where  1)  a  single 
statement  appears  in  the  body  of  a  multiply-nested  loop  and 
2)  the  indexes  of  the  variable  in  the  left-hand  side  of  the 
statement  differ  by  a  constant  from  the  corresponding  indexes 
in  each  reference  to  the  same  variable  in  the  right-hand  side. 
Alternative  computations  can  occur  in  each  iteration  as  a  result 
of  a  single  conditional  statement  as  long  as  data  dependencies 
do  not  change.  Nested  loop  programs  with  multiple  statements 
can  also  use  the  techniques  of  this  paper  together  with  the 
alignment  method  discussed  in  [24]  and  [25]. 

For  the  purpose  of  this  paper,  only  structural  information  of 
the  algorithm,  i.e.,  the  index  set  J  and  the  dependence  matrix 
D,  is  needed.  Other  information,  such  as  what  computations 
occur  at  different  points  and  where  and  when  input/output  of 
variables  takes  place,  can  be  ignored.  Therefore,  a  uniform 
dependence  algorithm  with  index  set  J  and  dependence  matrix 
D  is  hereon  characterized  simply  by  the  pair  (J,  D).  Herein,  it 
is  assumed  that,  as  in  Definition  2.1,  the  letters  n  and  m  always 
denote  the  dimension  of  index  points  in  J  and  the  number  of 
dependence  vectors,  respectively.  While-  no  restrictions  apply 
to  the  type  of  the  index  set  of  an  algorithm,  the  following  is 
assumed  in  regard  to  bow  index  sets  are  described. 

Assumption  2.1  (Index  set,  index  constraint  matrix,  size 
vector,  and  convex  hull  of  the  index  set):  The  index  set  J  C  Zn 
of  any  given  algorithm  is  described  as  a  set  of  integer  points 
(vectors)  whose  convex  hull  R  [14,  p.  35]  is  a  nondegenerate 
(explained  in  the  following  paragraph)  polyhedron,  i.e., 

R  =  {i:  Ax  <  b,  A  £  Z°*n,  b  €  Za.  x  €  R",  o  €  .V+} 

(2.2) 

and 

J<Z  {J:  J  €  RA;£  Z"}. 

Matrix  A  is  called  index  constraint  matrix  of  J;  vector  b  is 
called  size  vector;  R  is  called  convex  hull  of  index  set  J. 


KANO  AND  FORTES:  TIME  OPTIMAL  SCHEDULES 


723 


Equation  (2.2)  simply  states  that  A  is  a  polyhedron.  If  A 
contains  n  4- 1  points  xx,x2,  •  •  •  ,?n+i  such  that  3\  —  Sn+i, 
"■»  -  in+1,  xn+1  are  linearly  independent,  then  R  is 

tondegenerate.  For  example,  a  line  and  a  square  with  nonzero 
irea  are  cases  of  degenerate  and  nondegenerate  polyhedra 
m  R2,  respectively.  Clearly,  degenerate  polyhedra  in  Rn  can 
be  reexpressed  as  nondegenerate  polyhedra  in  Rn  for  some 
u'  <  n.  According  to  Assumption  2.1,  R  is  not  necessarily 
bounded  and  J  is  not  necessarily  dense  ( J  is  not  dense  if 
here  is  at  least  one  integer  point  in  R  that  does  not  belong 
o  J).  Extreme  points  of  polyhedron  R  are  always  integers 
and  belong  to  the  index  set  J  because  A  is  the  convex  hull 
>f  index  set  J.  Finally,  different  size  vectors  correspond  to 
instances  of  the  same  algorithm  that  differ  only  in  their  sizes 
( but  have  the  same  shape).  The  following  example  illustrates 
:he  concepts  introduced  in  Definition  2.1  and  Assumption  2.1. 

Example  2.1:  Consider  the  following  uniform  dependence 
algorithm: 


v(ji,h)  =  g(v(ji  -  2J7  -  2),v(ji,j2  -  3))  where 


h  =  0,  •  •  • ,  2s 

• .  2s  -  ji  if  a  <  <  2s  -,+ 

ju  ■••,2s-  ji  if  0  <ji  <  s  ’ 


The  index  set  J  of  this  algorithm  is  shown  in  Fig.  1.  The  reader 
can  verify  that  it  can  be  described  as  7=  {j  =  Aj  < 

h.  ]  £  Z2}  and  its  convex  hull  is  A  =  {x  =  [xi,x2]r:  A3  < 
x  £  R2},  where 


?**«!-«}  Pl=r l-«t  Fu  “««-«* 

«H°,  2*1 T  ?J=[0,  f \T  ?,-[«,  0jr  ?4-(2«,  0|r 


Fig.  1.  The  index  set  of  the  algorithm  of  Example  2.1.  Each  point  corre¬ 
sponds,  to  a  computation. 


vector  in  sp{Aa, ,  •  •  • .  Ahk },  then  for  all  Jlf  x2  in  F  =  {x : 
A/*!  =  bf,.,  16(1",!  =  1,  •••,*},  V(xi  -x2)  =  0.  For 
a  given  index  set  J  with  index  constraint  matrix  A  £  Zaxn, 
there  are  at  most  £5"^“’'**  (*)  distinct  norm  spaces. 
Example  2.2:  In.  Example  2.1,  Ax  =  [—1,0],  A2  =  [0,  — 1], 


There  are  four  extreme  points,  i.e.,  =  [0, 2s\T,  e2  = 

0.  s]r,  ?3  =  (s,0]r,  and  e4  =  [2s,0]r.  The  index  set  J 
in  this  example  is  bounded,  dense,  and  nondegenerate.  The 
dependence  matrix  is  D  =  [dx,^]  where  dx  =  [2,2]r  and 
12  =  [0.3] r.  End  of  example. 

In  the  example  above,  the  index  set  J  is  bounded  by  four 
lines  described  by  equations  xx  =  0,  x2  =  0,  xx  +  x2  =  a, 
and  xx  +  x2  =  2s,  respectively.  Each  line  is  called  a  bound¬ 
ary  surface  of  index  set  J  and  the  definitions  of  this  and  other 
ancillary  concepts  are  formally  sated  next. 

Definition  2.2  (Boundary  surface,  norm  vector,  norm  space, 
and  norm  intersection):  Let  A  €  Zaxn  be  the  index  constraint 
matrix  of  algorithm  (J,D)  and  let  A*,,--- ,A*fc  6  A,  k  < 
min  {a,  n},  be  linearly  independent  rows  of  A.  The  hyperplane 
F  =  {I :  Ah{3  =  bhi,  3  €  R",  i  =  1,  •  •  • .  k}  is  called  an 
1  n  -  k)~dimensional  boundary  surface  of  J  if  F  n  A  ^  0. 
The  row  vectors  A*, ,  •  •  • ,  A*k  are  called  the  norm  vectors  of 
boundary  surface  F,  the  notation  surf  { A*, ,  •  •  • .  A*t }  is  used 
to  represent  F  and  sp{.4Al ,  •  •  ■ ,  A\k }  is  called  the  norm  space 
(of  index  set  J)  associated  with  F.  The  intersection  of  any  two 
norm  spaces  of  J  is  called  a  norm  intersection  of  index  set  J. 

When  k  =  n  in  the  last  definition,  the  corresponding  sur¬ 
face  is  “zero-dimensional,”  i.e.,  it  is  an  extreme  point  of  J. 
The  norm  space  spannld  by  norm  vectors  A*, ,  -  -  • ,  A*4  is 
' perpendicular ■'*  to  ia  boundary  surface  F,  i.e_  if  V  is  a  row 


A3  =  [1, 1],  A4  =  [-1,  -1],  hi  =  0,  h*  =  0  and  h3  =  2s. 
As  shown  in  Fig.  1,  F\  =  (x  :  Axx  =  hi,  x  €  R2}  — 
[3  :  xi  =  0,  x  £  R2}  is  the  one-dimensional  boundary 
surface  of  J  containing  the  points  on  the  x2  axis.  Also. 
F4  =  {x  :  Axx  =  bi,  A$3  —  ks,x  £  R2}  =  [3  :  X\  = 
0,  xx  4-  x2  =  2s.  3  6  R2}  is  a  “zero”-dimensional  surface  of 
J  that  corresponds  to  the  extreme  point  ?x  =  [0, 2sjr.  Notice 
that  the  set  F  =  {x :  Aix  —  hi,  A2x  =  b-i,  3  £  R2}  =  {z : 
xx  ss  x2  =  0}  is  not  a  boundary  surface  because  F  n  A  =  0. 
F  =  (x:  A3X  =  63,  M3  =  h4,  3  £  R2}  is  not  a  boundary 
surface  because  A3  and  M  are  linearly  dependent.  One  norm 
intersection  of  J  is  the  intersection  sp{Ax}  n  sp{At,  A3}  = 
sp{Ax}  =  {X  —  [xi,x2]  :tj=0J6  Rlx2}  where  Ax, 
A3,  and  X  are  row  vectors,  i.e^  the  intersection  of  the  norm 
space  of  F\  and  the  norm  space  of  F4.  End  of  example. 

Definition  2.3  (Schedules):  A  schedule  for  algorithm  (J,  D ) 
is  a  function  <r :  J  —  Z  such  that  for  any  arbitrary  index 
points  J,/  €  J ,  cr(  J)  <  <r(f)  tij  =  ~j +2i,di  £  D. 

In  other  words,  a  schedule  is  a  mapping  which  assigns  a  time 
of  execution  to  each  computation  of  the  algorithm  in  such  a 
way  that  dependencies  are  preserved,  i.e.,  if  the  computation 
indexed  by  f  depends  on  the  computation  indexed  by  J,  then 
the  computation  indexed  by  f  can  be  executed  only  after  the 
execution  of  the  computation  indexed  by  J.  The  schedules  of 


7 26 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  40,  NO.  S,  JUNE  1991 


interest  in  this  paper  are  so-called  linear  schedules  which  are 
defined  next 

Definition  2.4:  For  algorithm  (J,  D)  a  schedule  on :  J  —»  Z 
is  a  linear  schedule  if  crn(J)  =  |Hj  +  cj,1  J  €  J,  where 
II  6  Qlxn  is  such  that  min{ndj  :  di  €  D}  =  1  and 
c  =  -  min{IIJ:  J  €  J}. 

Definitions  2.3  and  2.4  are  similar  to  equivalent  concepts  in 
[1].  In  the  last  definition,  the  entries  of  II  can  be  any  rational 
number.  However,  given  such  a  II,  it  is  always  possible  to 
find  II'  €  Zlxn  and  a  constant  diapU'  €  N+  such  that  II  = 
U! /diapU!  where  diapU!  equals  the  least  common  multiple 
of  the- denominators  of  the  entries  of  II.  For  this  reason,  the 
following  definition  of  a  linear  schedule  is  equivalent  to  that 
of  Definition  2.4  and  is  used  throughout  this  paper. 

Definition  2.5  (Linear  schedule  and  linear  schedule  vector): 
For  algorithm-  (J,  D)  a  Unear  schedule  is  a  function  on :  J  -* 
N  such  that 

<tn{j)  ~  [(Ilj  +  c) /dispU\ ,  j  eJ  (2.3) 

where  II  £  Zlxn,  diapU.  —  min{IIdi  :  (L  £  D)  >  0, 
gcd(iti,-  •  ■ ,  rn)2  =  1  and  c  =  -  min{IIJ:  J  £  J).  The  row 
vector  II  is  called  linear  schedule  vector  associated  with  on- 

Example  2.3:  Fig.  2  depicts  a  linear  schedule  on  = 
[([1, 1]J  -  5)/3j  for  the  algorithm  of  Example  2.1  (s  =  5) 
where  H  =  (1, 1],  c  =  -5  and  displl  —  3.  The  hyperplanes 
(dotted  lines  in  this  case)  described  by  the  equations  II J  =  d 
for  different  values  of  d  help  the  visualization  of  the  schedule 
because  all  points  contained  in  the  same  line  are  “executed” 
at  the  same  time.  Actually,  for  any  given  group  of  three 
consecutive  lines  (as  identified  in  Fig.  2),  all  points  on  these 
lines  are  executed  simultaneously.  In  general,  the  number  of 
consecutive  lines  to  execute  simultaneously  corresponds  to  the 
value  of  displl.  Notice  that  all  dependencies  are  “satisfied” 
and  it  can  be  verified  that  the  total  execution  time  achieved 
by  the  linear  schedule  in  the  same  as  the  one  achieved  by  the 
free  schedule.  End  of  example. 

The  total  execution  time  of  algorithm  ( J.  D )  with  a  linear 
schedule  on  is 


t(II)  =  max{<rn(J):  J  6  J}  -  min{<7n(J):  J  €  J}  +  1 
max{U(]l  -Ja):  J1J2  6  J}  +  l1 
min{n3i:  d*  €  D) 


(2.3a) 


t  - 


max{n(JI  -J2):  Jt.Ja  €  J} 
min{Il2j:  di  €  D\ 


+  1. 


(2.4) 


Because  t  is  minimum  if  the  argument  of  the  floor  function 
in  (2.4)  is  minimized,  the  problem  of  minimizing  the  total 
execution  time  t  can  be  stated  as  follows. 

Problem  2.1  (Tune  optimal  Unear  schedule  problem):  For 
algorithm  (J,D)  the  time  optimal  linear  schedule  problem 
consists  of  finding  a  linear  schedule  vector  II0  €  Qlxn  such 


1  (aj  =  the  greatest  integer  that  it  lets  than  or  equal  to  a. 
Jje4(«i,  ■  ,a„)  as  the  greateet  oobbboo  divisor  of  at .  -.a*. 


Fig.  2.  The  execution  order  by  a  linear  schedule  with  II  =  [1. 1]  for  the 
algorithm  of  Example  2.1  when  s  s  5.  Each  line  corresponds  to  a  hyperplane 
with  normal  vector  II.  All  points  on  three  consecutive  lines  can  be  scheduled 
to  execute  at  the  same  time  step  without  violating  the  dependency  relations. 


that  it  minimizes 

_  max{n(J1  -J2):  J1J2  €  •/} 
min{n<i1  -.  5,-  €  D} 
subject  to  ni  >0  di  €  D. 


(2.5) 


Let  5  denote  the  solution  space  of  Problem  2.1,  i.e.,  5  = 
{II:  ILD  >  5}.  The  set  5  includes  all  linear  schedule  vectors 
II  defined  in  Definition  2.5.  Notice  that  if  11°  is  an  optimal 
solution  of  Problem  2.1,  so  is  crlP  for  any  nonzero  constant 
a.  This  guarantees  that  the  optimal  solution  11°  can  always 
be  obtained  such  that  11°  £  Zlxn  and  the  greatest  common 
divisor  of  the  components  of  11°  is  equal  to  one.  After  11° 
is  found,  then,  according  to  Definition  2.5,  the  constant  c  can 
be  determined  and  the  corresponding  optimal  linear  schedule 
<7n*  can  be  specified  completely. 

As  in  [1],  the  linear  schedules  considered  in  this  paper 
satisfy  Definition  2.4  (or,  equivalently.  Definition  2 S).  In 
other  words,  they  are  described  by  (23)  and  the  condition 
that  displl  =  min{II<L  :  d*  £  D}  >  0.  Schedules  for 
which  displl  <  min { Ed* :  di  €  D}  are  easily  transformed 
into  equivalent  schedules  for  which  Definition  23  is  satisfied. 
However,  there  are  valid  schedules  described  by  (23)  for 
which  displl  >  min{IIdi :  2,-  €  D)  >  0  that  are  not  con¬ 
sidered  in  this  paper.  As  an  example,  for  algorithm  (J.D) 
where  n  =  m  =  1,  J  =  {0. 1, 3},  and  D  =  [2],  the  mapping 
0(3)  as  L((l|j)/3J,  3  =  0,1,3,  is  a  feasible  schedule  (i.e., 
respects  the  dependency).  Because  the  denominator  diapU.  =  3 
is  greater  than  txun{n2<:  £  D}  -  1  •  2  =  2,  this  schedule 
is  not  a  linear  schedule  (according  to  Definitions  2.4  and  23). 
However,  it  is  easy  to  prove  that  for  a  mapping  on  described 
by  (23),  if  there  exist  two  index  points  f,  3  £  J  such 


SHANG  AND  FORTES;  TIME  OPTIMAL  SCHEDULES 


727 


that  {Hj)/dispU  is  an  integer  and  j  +  <U  =  f,  d%  €  D, 
then  <rn  is  a  feasible  schedule  if  and  only  if  it  meets  the 
conditions  in  Definition  2.5,  i.e.,  displl  =  min  {lid,  ;  5,-  € 
D).  For  example,  for  algorithm  ({0, 1,2,3},  [2]),  mapping 
<r(j)  =  L([11j)/3I,  j  =  0, 1,2,3,  is  not  a  schedule  bemuse 
cr( 0)  =  <r(2)  and  the  computation  indexed  by  point  2  depends 
on  the  one  indexed  by  0.  So,  in  general,  for  algorithms  with 
reasonably  large  and  dense  index  sets  (the  prevalent  case  for 
algorithms  of  practical  interest).  Definition  2.5  includes  all 
schedules  described  by  mappings  of  the  form  of  (2.3).  In  fact, 
because  loop  bounds  are  often  unknown  at  compile  time,  it 
is  desirable  to  consider  only  those  schedules  that  are  valid 
regardless  of  the  size  of  the  algorithm-  Also,  for  large  and  not 
necessarily  dense  algorithms  which  do  not  satisfy  the  condition 
mentioned  above  it  is  often  possible  to  do  simple  algorithm 
transformations  that  yield  an  equivalent  algorithm  for  which 
the  condition  is  satisfied.  In  this  paper,  only  linear  schedules, 
as  defined  by  Definitions  2.4  and  2.5,  are  considered. 

The  solution  of  Problems  2.1  is  discussed  in  Section  III. 
The  remainder  of  this  section  contains  two  more  definitions 
which  are  necessary  for  discussions  in  the  rest  of  this  paper. 

When  J  is  bounded,  there  always  exist  extreme  points  Pi, 
P-k  of  J  such  that  maxfllO'!  -  j2) :  h,)?  6  1}  =  lie*  -  lie*. 
To  characterize  this,  projection  vectors  are  defined  next. 

Definition  2.6  (Projection  vectors):  A  column  vector  p  is  a 
projection  vector  if  it  is  the  difference  of  two  extreme  points 
Pi,  ek,  of  R.  (i.e.,  p  =  e,  -  e*).  Hereon,  the  symbol  q  is  used 
to  denote  the  number  of  projection  vectors  and  the  set  of  all 
the  projection  vectors  is  denoted  P,  i.e.,  P  —  {plt  •  •  • ./?,}. 

Example  2.4:  For  the  algorithm  of  Example  2.1,  its  solution 
space  is  shown  in  Fig.  3.  As  it  can  be  seen  from  Fig.  1, 

there  are  12  projection  vectors:  px  =  ei  -  e2,  p2  =  ®t  “  «3< 

p3  =  et  -  e4,  p4  =  e2  -  ?i,  p5  =  e2  -  e3,  Ps  =  e2  -  e4, 

Pr  =  e3  -  elt  Pa  =  e3  -  e2,  =  e 3  -  e4,  p10  =  e4  -  P\, 

pu  =  e4  —  e2,  and  p12  =  e4  —  e3.  End  of  example. 

Definition  2.7  (Minimum  dependence  vector  and  maximum 
projection  vector):  For  a  given  algorithm_(  J,  D)  and  a  linear 
schedule  vector  II,  a  dependence  vector_d  is  called  minimum 
dependence  vector  of  II  if  Ud  —  min{n^ :  ak  e  D)  =  displl 
and  a  projection  vector  p  is  called  maximum  projection  vector 
of  II  if  lip  =  maxjOpi :  pi  €  P}. 

Example  2.5:  Consider  the  algorithm  of  Example  2.1.  The 
minimum  dependence  vector  of  II  =  [0. 1]  is  d\  because 
Il3i  =  minjllSv,  rid2}  =  min{2,3}  =  2  and  one  of  its  maxi* 
mum  projection  vectors  is  p3  =  P\  -  e4.  End  of  example. 

III.  Solution  of  the  Time  Optimal  Linear 
Schedule  Problem — The  Bounded  Index  set  Case 

This  section  considers  Problem  2.1  when  the  index  set  J 
of  the  algorithm  under  consideration  is  bounded.  i.e.,  the 
cardinality  of  J  is  finite,  and  proposes  a  procedure  to  solve 
this  problem.  While  the  derivation  and  proofs  of  correctness 
of  this  method  are  somewhat  complex  and  long,  the  procedure 
itself  is  relatively  simple  and  easy  to  illustrate  by  examples. 
Therefore,  in  Section  IU-A.  the  optimization  procedure  is 
first  presented  and  illustrated  by  four  examples.  The  formal 
derivation  and  proofs  of  correctness  of  the  procedure  are 


discussed  in  Section  III-B  and  its  complexity  is  analyzed  in 
Section  m-C. 


A.  Optimization  Procedure 


Generally  speaking,  Problem  2.1  is  a  nonlinear  program¬ 
ming  problem  (in  the  sense  that  the  objective  function  is 
nonlinear)  with  linear  constraints.  The  solution  space  S  —  {13: 
IlS,  >  0,  di  €  D}  contains  ail  feasible  linear  schedule  vectors 
and  its  cardinality  is  infinite.  The  optimization  procedure 
proposed  in  this  section  constructs  a  candidate  set  C°  which 
is  a  finite  subset  of  5  and  contains  the  optimal  solution  of 
Problem  2.1.  Intuitively,  the  solution  space  5  is  a  convex 
cone  and  it  can  be  partitioned  into  a  finite  number  of  convex 
subcones  so  that  the  objective  function  in  (2-5)  becomes  a 
linear  fractional  function  in  each  subcone.  What  the  optimiza¬ 
tion  procedure  does  is  to  include  all  extreme  directions  of 
each  subcone  in  the  candidate  set  C°.  By  the  linear  fractional 
programming  theorem  [14,  Section  11.4],  one  of  the  extreme 
directions  is  the  optimal  solution  of  Problem  2.1. 

In  general,  the  optimal  linear  schedule  vector  11°  is  a 
function  of  the  size  vector  b  (see  Example  3.4),  i.e.,  the 
optimal  1  inear  jschedule  vector  11°  for  algorithm  ( J,  D )  with 
a  size  vector  b  may  be  different  from  the  one  for  the  same 
algorithm  (J,  D)  with  a  different  size  vector  bf.  However, 
for  ail  possible  different  size  vectors  of  the  same  algorithm 
( J.D ),  the  corresponding  optimal  linear  schedule  vectors 
belong  to  the  candidate  set  In  this  sense,  the  candidate  set 
C°  is  sufficient.  For  some  algorithms  (e.g..  Example  3.4),  it 
is  also  necessary  in  order  to  provide  optimal  linear  schedule 
vectors  for  all  possible  different  size  vectors  of  the  algorithm. 
In  other  words,  for  these  algorithms,  every  candidate  in  C° 
might  be  an  optimal  solution  for  certain  given  size  vector  b 
of  the  algorithm.  For  a  programmed  algorithm,  the  candidate 
set  C°  can  be  constructed  at  compile-time.  The  optimal  linear 
schedule  vector  11°  can  be  identified  either  at  compile-time  or 
run-time  by  enumeration  of  all  candidates  in  C° ,  depending 
on  whether  or  not  the  size  vector  6  is  known  at  compile-time. 

As  an  informal  description  of  the  procedure,  each  candidate 
II  €  C°  is  determined  by  a  pair  consisting  of  a  fc-dimensional 
norm  intersection  of  the  index  set  J  and  a  set  of  A;  linearly 
independent  dependence  vectors,  1  <  fe  <  rank(D).  Con¬ 
sider  a  A: -dimensional  norm  intersection  B'  and  a  set  of  A; 
linearly  independent  dependence  vectors  dtdti ,  •  •  • ,  dtk-i-  Let 
{Bi,  ■  •  • ,  Bk}  be  a  basis  for  B  (in  most  cases,  Bi,  i  —  1,  •  •  • ,  k, 
is  a  row  vector  of  the  index  constraint  matrix  .4),  let  B  = 

r*i 


•  •  •  and  let  H  =  [dt  -  dt,  dt  -  3t, ,  •  •  • ,  <U  -  A  row 


vector  II  is  obtained  as  follows: 


II  =  (3{a\B\  -f - b  atkBk)  (3.1) 

where  ai ,  •  •  • ,  a*  satisfy  equation 

[au--,ak}BH  =  0  (3.2) 

and  (3  is  a  constant  to  be  determined  according  to  condition 
2)  below.  For  each  vector  II,  the  following  three  conditions 
are  tested: 


rtt 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  40,  NO  S,  JUNE  1991 


Fig.  3.  Th* dotted  line*,  specify  the  solution  space  5  of  Problem  2.1  for  algorithm  of  Example  2.1  where  5  =  {II:  II D  >  =  {II:  +  n  >  0.  x2  >  0{. 

The  solid  lines,  specify  four  convex  subcodes  into  which  5  is  partitioned.  For  5,*,  f  as  I,  2,  A  =  2,  3,  11,  d,  is  minimum  and  is  maximum, 
tajectkm  vectors  are  shown  in  Fig.  1. 


1)  taak(BH)  =  k  -  1. 

2)  There  exists  a.  proper  value  of  0  such- that  IlS*  >  0, 
2i  6  D,  and  II  satisfies  Definition  2  J. 

3)  The  dependence  vectors  dt,  dtl ,  •  •  • ,  dtk_i  are  minimum 
dependence  vectots  of  II. 

If  any  one  of  the  three  conditions  is  not  met,  then  this  pair 
of  norm  intersection  and  dependence  vectors  is  not  good  and 
the  row  vector  II  is  ignored  (i.e.,  these  three  conditions  are 
necessary  for  the  optimality  of  IT).  If  all  conditions  are  met, 
then  the  row  vector  II  is  a  candidate  and  included  in  C°.  The 
procedure  considers  every  combination  of  each  ^-dimensional 
norm  intersection  with  every  possible  distinct  set  of  k  linearly 
independent  dependence  vectors*  1  <  k  <  rank (D),  to 
construct  the  candidate  set  C°.  The  formal  description  of  the 
procedure  is  as  follows. 

Procedure  3.1  (Construction  of  candidate  set  C°): 

Input:  The  index  constraint  matrix  A  and  the  dependence 

matrix  D  oil  algorithm  (J,  D)  where  index  set  J  is 
bounded. 

Output:  A  finite  set  of  candidates  C°  which,  for  any  arbi¬ 
trary  size  vector  b,  contains  the  optimal  solution  of 
Problem  2.1. 

Step  1:  Find  all  distinct  norm-  intersections  of  J  and  set 
k  ml,  l  ml. 

Step  2:  C*  =  0. 

Step  3:  Pick  a  k-dimensional  norm  intersection  of  J  and  find 
all  distinct  sets  of  k  linearly  independent  dependence 
vectors. 

Step  4:  Pick  a  set  of  k  linearly  independent  dependence 
vectors. 

Step  5:  Suppose  that  the  k  dependence  vectors  being  pro¬ 
cessed  are  3,,  dt,,  --, 3t*_,  and  a  basis  for  the 
fe -dimensional  norm  intersection  being  processed  is 

B\ 

Let  B  =  •••  and  H  =  [It  -  3», 

lB*. 

dt-2t,,- ■  ■  ,dt-dtk_,\.  If  rank(Blf)  -k-l,  then 


IIj  =  0[a\  ,---,atk\B  where  0  is  a  nonzero  constant 
and  [oi,  ■  •  • ,  ctk]  is  a  nonzero  solution  of  equation 
[oti,  •  •  • ,  at k]BH  =  0.  If  rank(J5/f)  <  k  -  1,  go  to 
Step  7. 

Step  6:  If  there  does  not  exist  a  constant  0  such  that  II|Z>  > 
0,  IIj  6  Zlxn  and  the  greatest  common  divisor  of 
the  components  of  Ilf  is  equal  to  one,  then  go  to 
Step  7;  otherwise,  set  the  value  for  0  and  if  dt, 
3t, ,  •  •  • .  dt,.,  are  minimum  dependence  vectors  of 
III,  then  Ci,  sb  C*  U  {II|}  and  l  =  /  +  1. 

Step  7:  Check  if  all  distinct  sets  of  k  linearly  independent 
dependence  vectors  have  been  processed.  If  not, 
pick  an  unprocessed  set  of  k  linearly  independent 
dependence  vectors  and  go  to  Step  S. 

Step  8:  Check  if  all  distinct  fc-dimensional  norm  intersec¬ 
tions  have  been  processed.  If  not,  pick  an  unpro- 
.  cessed  norm  intersection  and  go  to  Step  4. 

Step  9:  Check  the  value  of  k.  If  k  <  rank(D),  then  k  = 
k  +  1,  go  to  Step  2. 

Step  10:  C°  =  UJJW  Ck.  Stop. 

To  find  all  distinct  norm  intersections  in  Step  1,  all  norm 
spaces  have  to  be  found  first  by  considering  all  possible 
combinations  of  k  tows  of  the  index  constraint  matrix  .4,  k  = 
1,  •  •  • ,  min{a,  n}.  If  the  k  row  vectors  in  a  given  combination 
are  linearly  dependent,  then  that  combination  is  ignored.  If 
they  are  not,  the  row  vectors  in  this  combination  span  a  norm 
space.  According  to  Definition  2.2,  each  and  every  norm  space 
corresponds  to  a  set  of  linearly  independent  rows  of  matrix 
.4.  Therefore,  all  norm  spaces  are  found  in  this  way  and  can 
be  put  in  a  set  of  norm  spaces.  Clearlv,  as  mentioned  before 
Example  2.2,  there  are  at  most  (J)  distinct  norm 

spaces.  When  k  =  n,  there  is  only  one  norm  space  which  is 
the  real  space  R  and  can  be  spanned  by  any  set  of  n  linearly 
independent  rows  of  matrix  A.  Because  two  or  more  different 
sets  of  k  linearly  independent  rows  of  matrix  A  may  span  the 
same  norm  space,  some  norm  space  may  appear  more  than 
once  in  the  set  of  norm  spaces.  Of  course,  some  techniques  can 


SHANG  AND  FORTES;  TIME  OPTIMAL  SCHEDULES 


TABLE  I 

Tub  Noam  S  faces  Ni.-.Na  and  Norm  Intersections  or  J  op  the  Algorithm  op  Examples  2.1  and  3.1.  Norm 

Intersections  are  Obtained  by  Applying  Intersection  Operation  on  Two  Norm  Spaces.  As  an  Example,  the  Intersection 
op  Norm  Spaces  Sx  and  JV«  is  B\  s  Sx  Which  s  Indicaied  in  the  Second  Column  and  the  Last  Row  op  the  Table. 


N.-prII-I.OI) 

N  ,««»{l0,  — 1|) 

N1-.R<[t,l|) 

N.-«p(i-i,oi.ii.in 

N, 

B.-N, 

N. 

0 

B--N, 

N , 

0 

0 

B,-N, 

N, 

B-w- 

B.-N. 

-JilTflA- 

B,-N, 

be  applied  to  eliminate  the  repeated  appearance  of  norm  spaces 
to  keep  their  uniqueness  in  the  set.  However,  this  repeated 
appearance  in  the  set  does  not  affect  the  correctness  of  the 
final  result  of  Procedure  3.1. 

All  norm  intersections  can  be  found  by  intersecting  each 
norm  space  with  every  other  norm  space.  There  are  at  most 
(££rn>  OKEE^C)  -  1>  norm  intersections. 
As  explained  later  after  Example  3.4,  for  many  algorithms, 
each  and  every  norm  intersection  is  equal  m  some  norm  space 
and  therefore,  there  is  no  need  to  put  effort  in  finding  norm 
intersections.  To  find  a  basis  for  the  norm  intersection  B  of 
two  norm  spaces  sp{Atx ,  ■  •  • ,  At„x }  and  sp{ Aht ,  •  •  • ,  A/,^  } 
(needed  in  Step  5)  requires  the  solution  of  the  following 
equation: 


[<^I  ,  '  •  •  ,  ,  dki  +  lr  '  ‘  + 


r  Au 

a<m, 

Ah  j 


=  5.  (3J) 


L-Ah 


*1 


If  there  are  k  linearly  independent  solutions  A*  =  [£«, 
Sj2, ■  •  • , <5i(fc,+k,)l,  i  -  1, •  •  • , *  for  (3.3),  then  the  dimension 
of  B  is  k  and  a  basis  for  B  is 


[flu  •••$<*,] 


*  =  1, 


The  following  examples  are  used  to  show  how  the  candidate 
set  C°  is  constructed  according  to  Procedure  3.1  and  illustrate 
different  aspects  of  it  Detailed  discussion  of  different  aspects 
of  Procedure  3.1  is  given  after  these  examples. 

Example  3.1:  Procedure  3.1  is  now  applied  to  find  the 
optimal  linear  schedule  vector  11°  for  the  algorithm  of 
Example  2.1.  There  an  8  boundary  surfaces,  four  of  which 
correspond  to  the  four  boundary  lines  of  J  and  the  remaining 
four  correspond  to  the  four  extreme  points  of  J.  Then  an 
four  distinct  norm  spaces  iV„  i  =  1,  •  •  * ,  4  and  four  distinct 
norm  intersections  B„  i  =  1,  •  •  • ,  4,  which  are  listed  in 
Table  1.  The  derivation  of  these  norm  intersections  is  shown 
in  Table  I.  Notice  that  the  set  of  two  linearly  independent 
rows  {[—1,0],  (0,  -1]}  can  span  a  norm  space  which  is 
equal  to  iV4.  Norm  space  N%  is  removed  from  the  set  of 
norm  spaces  because  it  is  known  that  then  is  only  one  n- 
dimensional  norm  space  for  any  algorithms.  In  Step  1,  tin 
norm  intersections  an  found,  i.e.,  B\  =  sp{[— 1,0]},  B?  = 
sp{[0,-l|},  Bi  «  spfll,!]},  and  B*  =  jjp{(-1, Oj, (1, 1]} 
and  k  is  assigned  the  value  one.  According  to  Steps  3  and 


4,  each  one-dimensional  norm  intersection  (i.e.,  B\,  B2,  and 
S3)  is  considered  together  with  each  of  the  dependence  vectors 
(Le.,  2\  and  d2).  According  to  Step  5  and  because  for  all  pairs 
(Si,  di),  i  =  1, 2, 3,  l  =  1, 2,  H  =  0,  rank(BS)  =  0  =  k - 1, 
the  following  distinct  vectors  are  generated:  lit  =  A  [-1,0], 
Ha  =  /3a[0,  -1],  and  II3  =  /3j(l,  1].  For  Step  6,  Hr  is  not 
feasible  because  Hi  3j  =  0  regardless  of  the  value  of  0X.  If 
02  —  -1  and  0z  =  1,  then  Il2/>  >  5  and  II3.D  >  (j,  II2, 
II3  €  Zlx2  and  the  greatest  common  divisors  of  the  entries 
of  both  II2  and  H3  are  unity;  di  is  the  minimum  dependence 
vector  ofH2  and  d?  is  the  minimum  dependence  vector  of  II3 
(i.e.,  II3  and  II3  correspond  to  the  pairs  (B 2,  d\)  and  (S3,  d2), 
respectively.)  So,  C\  =  {II2,  II3}.  Let  A;  =  2,  there  is  only  one 
two-dimensional  norm  intersection  (S4  =  R2)  and  only_one 
set  of  two  linearly  independent  dependence  vectors  {dt.cfe}. 

According  to  Step  5,  B  = 


-1 

1 


H  = 


2 

-1 


and 


rank(SS)  =  1  =  k  -  1.  By  equation  =  0, 

[01,02]  =  [1,2]  and  the  corresponding  row  vector  II4  = 
0<ll,  2],  Similarly,  if  04  =  1,  conditions  in  Step  6  are  satisfied 
and  2i  and  S2  are  minimum  dependence  vectors  of  II4. 
Therefore,  C2  =  {H*}  and  C°  =  Cx  U  C2  =  {112,113,114}. 
C°  is  a  finite  subset  of  the  solution  space  S  which  is  shown 
in  Fig.  3.  The  candidate  with  shortest  execution  time  in  C° 
is  the  optimal  solution  of  Problem  2.1,  i.e.,  the  optimal  linear 
schedule  vector  for  this  algorithm.  The  total  execution  time 
by  each  candidate  in  C°  is  evaluated  according  to  (2.3a)  as 
follows: 


f(n2)  = 

'{0,  l]([0,2*}r-M]r)  +1' 

2s  + 1} 

2 

2  1 

t(n3)  = 

'  [1, 1]  ([2s,  0]r  —  [s,  0]T)  + 1 

‘s  +  ll 

3 

3  1 

t(IU)  = 

'  [1’  2]  ([0. 2s]r  —  [a,  0]T)  4- 1 

3s  + 1"| 

6 

6  1 

Dearly,  II3  is  of  the  shortest  execution  time.  Therefore,  the 
optimal  linear  schedule  vector  for  this  algorithm  is  II3,  i.e., 
II0  =  H3.  In  this  example,  because  there  is  only  one  variable 
a  in  the  size  vector  b,  the  optimal  linear  schedule  vector  is 
independent  of  the  size  vector.  End  of  example. 

Example  3.2 :  Consider  the  algorithm  (J,D)  with  D  = 
where  3i  =  [1,-3]  ,  32  —  [2, 4]r,  and  33  = 
[2, 0}T.  The  index  set  J  is  the  same  as  the  algorithm  of 
Example  2.1  and.  shown  in  Fig.  1.  Because  the  index  set  of 


730 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL.  40,  NO.  6,  JUNE  1991 


TABLE  It 

Norm  Spaces  and  Norm  Intersections  op  Example  3_3.  The  Derivation  op  the  Norm  Intersections 
is  as  Described  in  Tabu  I.  Notice  That  Bt,  i  =  5.6, 7,  Are  Different  From  All  the  Norm  Spaces. 


I 

1 

N,- 

N,- 

N.- 

N»- 

Nr- 

n7- 

N„- 

j 

L  .  _ 

iRil-i.a.ij) 

»r{|I.0.0|) 

'1,0, 0|) 

it.-i.-m 

(-1.1.0|,|1,0.0|) 

N, 

B,  -  N, 

!  1 

n 

MB 

B,  -  N  , 

1 

1 

mmm 

L 

nag 

N, 

b  „ 

■eh 

0  i _ £ _ 

B»-N, 

ea 

■eh 

■RH 

BH 

0 

B, 

B,  -  N, 

N- 

n 

0 

B. 

B, 

B, 

Csss 

0 

a. 

B 

0 

B, 

wm 

m 

| 

0 

B- 

0 

B, 

B, 

mm 

B, 

B«-N, 

_ 

N,„ 

0 

0 

Br 

B, 

B, 

B» 

B, 

N  „ 

B, 

B- 

B, 

B, 

B, 

B* 

8„ 

B„ 

B,» 

B„ 

B,«  -  N„ 

this  algorithm  is  the  same  as  the  one  of  the  algorithm  of 
Example  3.1,  the  norm  intersections  of  J  are  the  same  which 
are  shown  in  Table  I.  For  k  =  1,  similarly  to  Example  3.1, 
all  pairs  (B,.3i),  i,  l  =  1,2,3,  are  considered.  According 
to  Step  5,  three  distinct  row  vectors  are  generated:  IIi  = 
/3i[— 1,  OJ,  II2  =  02[O, -1J,  and  n3  =  03(1,1].  For  Step  6, 
H?  and  II3  are  not  feasible  because  Il^da  =  0  regardless 
of  the  value  of  fa  and,  I^Si  <  0  if  03  >  0,  n33j  <  0 
if  03  <  0.  If  fa  =  -1,  Hill  >  <5,  IIi  e  Zlx3  and  the 
greatest  common  divisor  of  the  componentsof  Hi  is  unity.  The 
minimum  dependence  vector  of  IIi  is  Si  [i.e..  Hi  corresponds 
tothepair(JBi,3i)].  So,  Ci  =  {Hi  =  [1,0]}.  Forfc  =  2,  there 
is  only  one  two-dimensional  norm  intersection  £4  and  there 
are  three  distinct  sets  of  two  linearly  independent  dependence 
vectors:  {3i,3j},  {5i,33},  and  {3j,33}.  Thus,  there  are  three 
pairs  of  a  two-dimensional  norm  intersection  and  a  set  of 
two  linearly  independent  dependence  vectors:  (£4,  {fa,  fa}), 
(£4,  {31,3s}),  and  (£4,  {3j,33}).  According  to  Step  5,  three 
distinct  row  vectors  are  generated;  II4  =  04(7, -1],  II5  = 
0s [3.  -1],  and  II «  =  0a(l,O].  For  Step  6,  DU,  II3,  and  H5  are 
all  feasible.  However,  II3  should  not  be  included  in  C?  because 
IIs  corresponds  to  the  pair  (£4,  {3i,  33})  and  3i,  33  are  not  its 
minimum  dependence  vectors.  For  the  same  reason,  II«  should 
not  be  included  in  C7.  If  fa  -  1,  EU£  >  5,  II4  €  Zlx 2 
and  the  greatest  common  divisor  of  the  components  of  QU 
is  unity.  Vector  II4  corresponds  to  the  pair  (£4, {Si.d?}) 
and  dx,fa  are  minimum  dependence  vectors  of  II4.  So,  IX4 
should  be  included  in  Cj.  Therefore,  Cj  =  {H»  =  [7,  -lj} 
and  C°  =  {IIi,n4}.  To  find  the  optimal  linear  schedule,  it  is 
necessary  to  compare  the  execution  times  for  all  candidates  in 
C°.  The  total  execubott  time  by  ead>  candidate' is  evaluated. 


as  follows: 

t(IIi)  = 

t{n4)  = 


[l,0]([2s.0]r-[0.s]r) 


+  1 


=  23  +  1 


[7,-l]([23.0]r-[0.23|r) 

10 


+ 1 


163  +  1 
10 


Clearly,  II4  has  the  shortest  execution  time.  Therefore,  11°  = 
EU  =  [7,-1].  End  of  example. 

Example  3.3:  Consider  the  algorithm  (J,D)  where  J  = 
{j:Aj<  6,  J  €  Z3}, 


•-1 

0 

1  • 

•0- 

.4  = 

-1 

1 

0 

,  b  = 

0 

1 

0 

0 

3 

.  1 

-1 

-1. 

.0. 

and  dependence  matrix 


D  = 


'l  0  r 
1  1  1 
10  0 


The  index  set  J,  is  shown  in  Fig.  4.  Pictorially,  J  is  sur¬ 
rounded  by  four  planes  described  by  equations  xi  -  *3  =  0, 
x\  -  xj  =  0,  zi  =  3,  and  -xx  +  xj  +  13  =  0.  For 
Step  1,  there  are  11  distinct  norm  spaces  jV<,  i  =  1,  -  •  - ,  11. 
listed  in  Table  II.  The  corresponding  boundary  surfaces  F, 
of  N\,  i  =  1,  •  •  • .  11  are  indicated  in  Fig.  4.  There  are  four 
two-dimensional  boundary  surfaces  (planes)  Fx,  ■  •  ■ ,  F4,  six 
one-dimensional  boundary  surfaces  (lines)  F5,  •  •  • ,  F\q  and 
one  zero-dimensional  boundary  surface  Fn.  The  derivations 
of  ail  norm  intersections  are  shown-  in-  Table  IL  There 
are  14  distinct  norm  intersections,  i.e..  £t  =  sp{[— 1.0, 1]}, 
£2  =  3p{[-l,  1,0]},  £3  =  3P{[1.0.0]},  £4  =  ap{[l,-l. 
-1]},  £5  =  sp{[0,1.0j},  £,  =  sp{[0r0,l]},  £r  =  ap{[0. 
1,1]},  £»  =  sp{(-1.0.1],[-l,1.0)},  £9  =  *p([-I,0,ll, 


SHAKO  AND  FOKTES:  TIME  OPTIMAL  SCHEDULES 


731 


R-(x:AxSb.x«lF»3) 


Fig.  4.  The  4- face  polyhedron  inside  the  cube  is  the  index  set  of  the  algorithm  of  Example  3.3.  It  has  four  extreme  points  ?i ,  •  •  • .  four  two-dimensional 
boundary  surfaces  F\ ,  ■  •  • ,  and  six  one-dimensional  boundary  surfaces  /s,  ■  -  - .  -Fio-  Fi,  is  associated  with  the  norm  space  ,Vt,  i  =  1,  -  -  - ,  11, 
listed  in  Table  II. 


[1,0,01},  B 10  =  «p{[-l,0,l],[l,-l,-ll}.  Bn  =  3P{[-1, 
l.0|,  [1,0, 0|}f  Bn  =  sp{[-l,  1,0],  [1, -1,-1)},  Bn  = 
sp{[l,0, 0],  [1,  ~1,  — 1]},  and  Bu  =  ap{[-l,0f  1],  [-1, 1.0], 
[1,0,0)}.  Notice  that  Bs,  Be,  and  BT  are  no  longer  norm 
spaces  of  7.  So,  in  general,  it  is  necessary  to  identify 
ail  norm  intersections  of  7  instead  of  the  norm  space 
of  7.  As  an  example,  if  a  candidate  II  belongs  to  Bg, 
the  intersection  of  norm  spaces  Nr  and  Ng,  ‘hen  it  is 
perpendicular  to  both  Ft  and  Fg  which  are  associated  to 
iVy  and  Ng,  respectively.  Table  HI  lists  all  row  vectors  and 
all  candidates  in  C°  generated  by  Procedure  3.1  and  their 
corresponding  pairs  of  a  A: -dimensional  norm  intersection 
and  a  set  of  A:  linearly  independent  dependence  vectors. 
Each  row  of  Table  in  corresponds  to  a  row  vector  and  its 
corresponding  pair  of  a  A: -dimensional  norm  intersection  and 
a  set  of  A;  linearly  independent  dependence  vectors.  Column  1 
indicates  the  value  of  k.  Column  2  lists  the  corresponding 
norm  intersections.  Column  3  lists  the  corresponding  sets  of 
dependence  vectors.  Column  4  lists  the  corresponding  row 
vectors  obtained  according  to  Step  5  in  Procedure  3.1  if 
raak(J?lf)  =  k,  or  “NOT  OK”  if  rank (BH)  <  k  -  1  (B  and 
H  are  defined  in  Step  S  of  Procedure  3.1).  Column  S  tests 
the  feasibility  of  the  corresponding  row  vector  II.  The  case 
where  II  is  not  feasible  is  indicated  by  “NO.”  If  II  is  feasible, 
then  it  indicates  the  value  of  constant  &  (i.e.,  0  in  Step  5  of 
Procedure  3.1)  such  that  TLD  >  (5,  II  €  Zlxn  and  the  greatest 
common  divisor  of  the  components  of  II  is  unity.  Column  6 
tests  if  the  dependence  vectors  in  column  2  are  minimum 
dependence  vectors  of  II.  Column  7  gives  the  execution  tune 
of  the  candidates.  For  k  =  1,  there  should  be  21  distinct 
pairs  of  a  one-dimensiqnal  norm  intersection  and  a  set  of  one 
dependence  vector  because  there  are  seven  one-dimensional 
norm  intersections  and  three  dependence  vectors.  However, 


only  seven  such  pairs  are  listed  in  the  first  seven  rows  of 
Table  III  because,  according  to  Step  S,  only  seven  distinct  row 
vectors  are  generated  by  these  21  pairs.  For  A:  =  2,  there 
are  18  distinct  pairs  of  a  two-dimensional  rorm  intersection 
and  a  set  of  two  linearly  independent  dependence  vectors 
because  there  are  six  two-dimensional  norm  intersections  and 
three  distinct  sets  of  two  linearly  independent  dependence 
vectors.  All  these  pairs  are  listed  in  Table  III.  For  k  =  3, 
there  is  only  one  such  pair  corresponding  to  the  last  row  of 
Table  IQ.  There  are  only  two  distinct  candidates  in  C°,  i.e., 
C*  =  {[0, 1, 0],  [0, 1, 1]}.  Because  the  linear  schedule  vector 
[0, 1, 0]  has  shorter  execution  time,  11°  =  [0, 1, 0]  and  the  total 
execution  time  is  a  +  1.  End  of  example. 

Example  3.4:  Consider  the  algorithm  with  index  set  7  = 
{  J :  A]  <  5, ;  6  Z3}  and  dependence  matrix 


L«J 


The  index  set  7  is  shown  in  Fig.  5  and  the  dependence  matrix 
is  the  same  as  the  algorithm  of  Example  2.1.  According  to 
Procedure  3.1,  the  candidate  set  C°  :s  constructed  as  C°  — 


732 


tEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  40,  NO.  6,  JUNE  1991 


TABLE  m 

Derivation  op  All  Candidates  op  algouthm  op  Example  3J.  Column  l  Indicates  the  Value  op  k  Column  2  Lots  the  Corresponding  Norm 
Intersections;  Column  3  Lots  the  Corresponding  Dependence  Vectors;  Column  4  Lots  the  Corresponding  Row  Vectors  Obtained 
According  to  Step  3  in  Procedure  3.1;  Column  S  Tests  the  Feasibility  op  the  Row  Vector;  Column  6  Tests  if  the  Dependence 
Vectors  in  Column  3  are  Minimum  and  Column  7  Lots  the  Execution  Times  op  the  Linear  Schedule  Vectors  in  Column  4. 


n 

j  Norm  latsrssctioa 

J-.oct or 

Row  Vector 

Feasible? 

Minimum? 

Execution 

i 

Tims 

1 1 

1,0,1) 

4, 

n,  -a,  !-i,o,i) 

No 

i  » 

B»-ra<|— u,0|} 

4, 

n2»d.[-i,i,oj 

No 

:  1 

B,-»r((1.0.0|} 

dn 

n,-d,(i,o.o) 

No 

1 

1 

1 

B,-.pl(l,-l,-l|) 

4, 

n,-a,|i,-i.-i| 

No 

1 

Bs“*pi|0,l,0j) 

4, 

ns-ds[o,i,o) 

d*-l 

Ye 

s+l 

1 

B*“*P<[0,0,1|) 

dn 

n,w3,  (o.o.i) 

No 

j 

1 

Br-«p(  (0,1,1!) 

nT-d7(o,u) 

dr-1 

Yes 

2s  +  1 

2 

B*—«p((— 1,0,1), 

(-1.1.0I) 

4|.4j 

n.-d,[-i,o,i) 

No 

2 

B, 

4.,4j 

n,-3,(-i.i.o| 

No 

2 

B, 

4i.4t 

fli»“4i«(0,l,— l) 

No 

2 

B»-*p< (-1,0,1), 

(l.o.o)) 

4i.4* 

Hu— dn{—  i>o,i| 

No 

2 

B, 

4t,4» 

Hu-dull, 0,0) 

No 

2 

B, 

4o.4j 

H|j-dta(0,0,l| 

No 

2 

Bw-rp([-l,0,l|, 

(1.-1.-1I) 

4i,4j 

Not  OK 

2 

Bw 

4„4, 

H|»—  dis  (0,1,0) 

d,»- 1 

Yea 

s  +  1 

2 

Bio 

4j,4o 

n„-d„(o,i,o| 

J„-l 

Ye* 

s  +  1 

2 

Bm«<rI(— t.1,0), 

1 

(1.0,01) 

4, .4, 

Hu— dui0,l,0| 

d,T-l 

Yes 

S  f  1 

• 

B„ 

Not  OK 

1  2 

Bn 

H|«— d,(|0,i,0|  i 

d„-l 

Yes 

*  *  1  j 

2 

Bis-*pl(-I,l,0), 

i 

t 

1 

i 

[I.-I.-IO 

4|  ,4, 

na— dai{l.-I>-l(  1 

No 

1 

1* 

B|» 

4i,4. 

hUi.0|  | 

No 

1 

2 

Bt: 

4-,4i 

firs— d"l*,d,Xj  1 

No 

j 

2 

b,i,-«rUi.o.o|, 

1 

! 

(l.-l.-ll) 

4,.  4, 

n3-d;j(l,-l,-l)  1 

No 

1 

2 

B13 

4i,4s 

n«-d„(i,o,o)  | 

No 

2 

Btj 

4»,4* 

Ha-d^  (0,1,1)  | 

Jy  *  i 

Yss 

2s  +  1 

3 

B„-*‘ 

n.jgma  Jng[0vl,0j  ! 

_Ja=L_ 

Yes 

•  +  1 

{[0. 1],[1,2}}.  The  total  execution  time  by  each  candidate  is 
evaluated  as  follows. 


*([0.1])  * 
*(M)- 


[0,1]  ([0.a,lr-[0.0lr)+l 


[Ml 

2 

([dl,32lr-[0,0}r] 

1 

-1 

3j  +  232  +  1 

6 

6 

32  +  1 
2 


Gearly,  the  optimal  linear  schedule  depends  on  variables  st 
and  S2,  i.e.,  size  vector  6.  If  $2  <  s;  -  2,  then  (32  +  l)/2  < 
(31  +  232  +  l)/6  and  [0,  lj  is  the  optimal  solution.  Similarly, 
when  32  >  3i  -  2,  vector  [1,2]  is  the  optimal  solution. 
Therefore,  for  this  algorithm,  the  candidate  set  C°  is  necessary, 
i.e.,  every  candidate  is  an  optimal  solution  for  some  size 
vector.  End  of  example. 

From  the  discussion  of  the  above  examples  it  is  possible 
to  get  additional  insights  into  the  candidates  identified  by 
Procedure  3.1.  Let  0  s  ap{Bi,  •  ■  ■ ,  Bt,}  be  a  fc-dimensional 
norm  intersection  of  the  norm  spaces  associated  with  boundary 


Fig.  3.  The  index  set  of  algorithm  of  Example  3.4.  The  optimal  linear 
schedule  depend*  on  size  vector  b  because  it  baa  two  variable*  ji  and 


surfaces  F\  and  F2,  respectively.  Intuitively,  if  a  candidate 

E  €  C°  is  determined  by  B  and  a  set  of  k  linearly  independent 

dependence  vectors  3»,  3»t ,  •  •  • ,  3t*_, ,  then  E  is  perpendicular 
to  both  boundary  surfaces  F\  and  Fj  and  Ed*  =  Ed*,  =  -  •  -  = 


HANG  AND  FORTES:  TIME  OPTIMAL  SCHEDULES 


733 


Id*,,.,,  i.e.,  II  satisfies  the  following  equation 

’n{2t-dt)  =o 

<  n((it  -  dti)  =  0  (3  4) 

kn  (3e-5t4_l)=0. 

For  example,  in  Example  3.1,  linear  schedule  vector  IT3  is 
determined  by  the  pair  ( B 3,  fa).  As  it  can  be  seen  in  Fig.  1, 
:he  norm  intersection  B3  can  be  thought  as  the  intersection 
>f  the  norm  spaces  associated  with  F3  =  surf  {A3}  and 
Fj  =  surf{Ai,  A4}.  Surface  F3  is  the  line  through  points 
"1  and  e4  and  Fj  is  the  extreme  point  e?  shown  in  Fig.  1. 
It  can  be  verified  that  II3  is  perpendicular  to  both  surfaces 
Fj  and  Fj  and  satisfies  equation  Il3(  d?  -  d?)  =0.  Vector 
Q4  is  determined  by  the  pair  (B4,  {dx,d3})  where  B4  is 
the  intersection  of  the  norm  spaces  associated  with  any  two 
extreme  points  of  J  shown  in  Fig.  1.  It  can  be  verified  that  II4 
is  perpendicular  to  any  extreme  points  and  satisfies  equations 
n4  ( 3t  —  d\ )  =0  and  II4  ( d  1  —  dj)  =0. 

"Hie  above  examples  also  illustrate  the  following  different 
aspects  of  Procedure  3.1.  First,  Examples  3.1  and  3.2  show 
hat,  to  find  the  optimal  linear  schedule  vector,  not  only 
the  dependence  vectors  should  be  considered,  but  also  the 
boundary  surfaces,  or  the  shape  of  the  index  set  must  be 
taken  into  account.  In  Example  3.1,  the  optimal  linear  schedule 
vector  IP  is  determined  by  boundary  surfaces  F3  =  surf  {A3} 
and  F3  =  surf{Ai,A4},  i.e.,  11°  is  perpendicular  to  both  F3 
and  Fj.  In  contrast,  the  optimal  linear  schedule  vector  11°  for 
the  algorithm  of  Example  3.2  is  determined  by  (3.4)  with  two 
linearly  independent  dependence  vectors  dx  and  d?.  Second, 
tor  most  algorithms  (e.g.,  the  algorithm  of  Example  3.1)  each 
norm  intersection  is  still  one  of  the  norm  spaces  of  J.  It 
is  shown  in  Section  V  that  this  is  true  for  the  algorithms 
whose  index  sets  are  shaped  as  a  hyperparallelepiped.  In 
this  case,  there  is  no  need  to  find  any  norm  intersections. 
However,  in  general,  some  norm  intersections  are  different 
from  all  norm  spaces  as  indicated  by  Example  3.3  (e.g.,  B 5  is 
different  from  all  norm  spaces).  Third,  Example  3.4  indicates 
that,  the  optimal  linear  schedule  vector  IP  may  depend  on  the 
size  vector  b  and  each  candidate  in  C°  is  an  optimal  linear 
schedule  vector  for  some  instance  of  the  algorithm.  Finally, 
when  k  =  1,  (3.2)  is  trivial  because  H  —  5.  The  corresponding 
row  vector  II  is  solely  determined  by  (3.1).  When  k  =  n, 
the  n-dimensional  norm  intersection  is  simply  R"  and  (3.1) 
is  trivial.  In  this  case,  the  corresponding  row  vector  can  be 
determined  solely  by  (3.4). 

According  to  the  observations  above,  it  is  clear  that  for 
;wo*dimensional  algorithms  (Examples  3.1,  3.2,  and  3.4),  be¬ 
cause  f'  norm  intersections  are  equal  to  some  norm  spaces, 
there  is  no  need  to  find  norm  intersections.  Therefore,  what 
Procedure  3.1  does  for  two-dimensional  algorithms  is,  for  each 
ind  every  boundary  line,  to  put  the  vector  perpendicular  to  that 
boundary  line  in  the  candidate  set  if  this  vector  is  feasible: 
In  addition.  Procedure  3.1  considers  all  distinct  sets  of  two 
inearly  independent  dependence  vectors  {<L,d»}  and  puts  the 
sector  II  such  that  II(  <L  -  d()  —  0  in  the  candidate  set  if  II 
is  feasible. 


B.  Correctness  of  the  Procedure 

In  this  subsection,  it  is  shown  that  11°,  the  optimal  solution 
of  Problem  2.1,  belongs  to  the  candidate  set  C°  constructed 
by  Procedure  3.1.  This  fact  is  stated  as  Theorem  3.1  and  is 
proven  by  using  six  lemmas.  All  the  proofs  of  these  lemmas  are 
provided  in  the  Appendix.  However,  an  example  is  presented 
to  illustrate  the  ideas  behind  these  proofs. 

Theorem  3.1:  The  optimal  solution  11°  of  Problem  2.1  be¬ 
longs  to  C°,  the  candidate  set  constructed  by  Procedure  3.1, 
i.e.,  n°  6  C°. 

The  basic  idea  behind  the  proof  of  Theorem  3.1  is  as 
follows.  First,  it  is  shown  that  the  solution  space  5  is  a 
convex  cone  (Lemma  1)  that  can  be  partitioned  into  convex 
subcones  Sth  where  dt  is  the  minimum  dependency  and  ph  is 
the  maximum  projection  vector  (Lemmas  2_  and  3).  In  other 
words,  for  any  arbitrary  II  Jn  Sth,  max[JI(  jx  -  J2):  Ji,J2  € 
J}  =  lip*  and  min  {ILL :  <L  €  D}  =  lid*  which  implies  that 
the  objective  function  described  by  (2.5)  simply  becomes  a 
linear  fractional  function  of  H  as  follows  (Lemma  2): 


/  = 


Hph 

nSe 


II  €  Sth- 


(3.5) 


Each  direction  [14,  p.  55]  in  the  convex  cone  5  corresponds 
a  feasible  solution  of  Problem  2.1.  Because  for  every  II  6  S, 
/(All)  =  /(II)  for  any  nonzero  constant  A,  /  is  a  function 
of  directions  in  the  cone  instead  of  the  points,  i.e.,  for  all 
points  on  the  same  direction  except  the  origin,  the  objective 
function  takes  the  same  value  (Lemma  1).  According  to  [14. 
Section  11.4],  Problem  2.1  is  a  linear  fractional  program  in 
each  convex  subcone  5**,  and  the  optimal  direction  of  / 
over  Sth  is  one  of  the  extreme  directions  of  the  subcone  Sth 
(Lemma  4).  Therefore,  one  of  the  extreme  directions  of  Sth, 
t  =  l.  -  -  ,m  and  h  =  1, •  •  • , q,  must  be  the  optimal  solution 
II®.  Finally,  it  is  shown  that  all  such  extreme  directions  are 
included  in  the  candidate  set  C°  constructed  by  Procedure  3.1 
(Lemmas  5,  6),  i.e.,  each  candidate  in  C°  corresponds  to  an 
extreme  direction  of  convex  subcone  Sth  for  some  t  and  h  and 
vice  versa.  Therefore,  the  candidate  in  C°  with  the  shortest 
execution  time  is  an  optimal  solution  of  Problem  2.1. 

Lemma  1:  The  solution  space  5  is  a  convex  cone  and  the  ob¬ 
jective  function  /  =  (max{n(J1  -J2):  J1J2  €  /})/(min 
{ELL  :  <L  €  D})  is  a  function  of  the  directions  in  5. 

Lemma  2:  Let 


sth  =  H  {n:  n<4  <  ni}  H  {n-- Op*  >  n&}  n s 


ml 

»*« 


(3.6) 


=  {11:11(^-30  <o.n(pfc-p^)  <0. 

i  =  1,  •  •  • .  m.  fc  =  1.  •  •  • .  q,  II  €  S}.  (3.7) 


If  Sth  is  not  empty,  then  Sth  is  a  convex  cone  where  the 
objective  function  can  be  reexpressed  as 

.  _  tm«{n(J1  -  J2):  J1J2  €  •/}  =  lip* 
min{ILL:  <L  €  D}  115* 

Lemma  3:  S  =  Utli  U1L1 


734 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  40,  NO.  6.  JUNE  1991 


Lemma  4:  An  optimal  direction  for  the  objective  function 
/  =  (nph)/(TI dt)  over  5th  is  one  of  the  extreme  directions 
Of  Sth* 

Lemma  5:  Let  II*  €  5th.  Then  ^*  ‘s  40  extreme  direction  of 
Sth  if  and  only  if  their  exists  a  set  of  n- 1  linearly  independent 
vectors  dt  -dtk._x,  Ph  ~Phx,--,Ph 

such  that  II*  satisfies  the  following  equations: 


'n(dt-dt)  =0 
n(2t-dtl)  =o 


f  n(p*  -p*,)  =o 


(b)< 


(n(ph-phn_t()  =o 

(3.8) 


where  1  <  kf  <  rank(Z?). 

Lemma  6:  Let  II*  6  Sth  be  an.  extreme  direction  satisfying 
(3.8).  Then  the  following  statements  are  true. 

a)  It,  ,  •  •  ■ » dtk,  _  ,  are  linearly  independent  and  are  mini¬ 
mum  dependence  vectors  of  H*. 

b)  There  exists  a  fc -dimensional  norm  intersectioa  B  = 
sp{Bi,  •  ■ , Bk}  such  that  II*  €  B  and  *  <  k1,  i.e.,  for 
some  constants  atj,  i  =  1,  •  •  • ,  k 

k 

n*  =  Y. aiB'-  (3-9) 

imi 


c)  Lets 


and  H' 


[dt  "m~dtidt~~dti1'''1dt  — 


|*»J 

St^  J.  Then  rank(A/f')  =  k  —  1. 
d)  Without  loss  ^f  generality,  let  A  =  [5,  -  dt,dt  — 
It, ,  ■  •  •  >  dt  -  dt»_,]  be  such  that  rank(AA)  =  k  -  1. 
Then  II*  can  be  obtained  from  (3.9)  and  the  equation 


(<*!*•••,  otk]BH  =  5.  (3.10) 


Lemma  6  simply  sates  (3.8)  is  equivalent  to  (3.9)  and 
(3.10),  i.e.,  II*  is  a  solution  of  (3.8)  if  and  only  if  it  is  a 
solution  of  (3.9)  and  (3.10).  This  provides  another  way  to  find 
all  extreme  directions  because  in  practice,  it  is  not  easy  to 
identify  all  projection  vectors,  especially,  when  the  number  of 
the  extreme  points  is  large.  Ail  extreme  directions  can  be  more 
easily  obtained  by  (3.9)  and  (3.10). 

Example  3.5:  The  algorithm  of  Example  2.1  is  used  to 
illustrate  Lemmas  1-6.  The  solution  space  of  Problem  2.1  for 
this  algorithm  is  shown  in  Fig.  3.  Lemmas  1-4  are  illustrated 
as  follows.  The  solution  space  S  is  a  convex  cone  (II:  Ildi  > 
O.Il32  >  0)  =  {n  :  2*!  4-  2ir2  >  0,3ir2  >  0}.  There  are 
24  subcones  Sth,  t  =  1. 2,  h  =  1,  •  •  • .  12.  However,  it  can 
be  verified  that  there  are  only  4  nonempty  subcones  S13,  Sia, 
Sn,  and  52.u-  Therefore,  the  solution  space  5  is  partitioned 
into  four  convex  subcones  Su,  Sn,  Sn,  and  S2,n  as  shown 
in  Fig.  3.  The  union  of  the  four  subcones  is  S.  Consider 
subcone  Sn,  di  is  the  minimum  dependence  vector  and 
is  the  maximum  projection  vector;  the  objective  function  over 


S13  is 


f  =  HE!  =  _  fizll  ±  *2) 

<»  rr“j 


ndi 


subject  to  < 


[7r1,T2][2,2jr 
ndi  <  IId2 

nph  >  Pi, 
iw>o. 


irj  4 •  ir2 


t  =  1.  -,12 


This  is  a  linear  fractional  programming  problem  and  one  of 
the  extreme  directions  is  the  optimal  solution  of  /  over  S13 
[14,  p.  471].  By  inspection,  the  set  of  extreme  directions  is 
{[0.1],  [1,2],  [1,1]}  which  is  exactly  the  candidate  set  C° 
constructed  by  Procedure  3.1  in_Example  3.1.  To  illustrate 
Lemma  5,  St2  =  (II :  II(  —  d2)  <  0,  II(p,-  -  p2)_<  0, 

t  =  1,  ■  •  • ,  12,  n  €  s}  =  (ii:  n(  P3 — p2)  <  0,  n(  dx  —  <i2)  < 

0,  II  €  S}.  Clearly,  fl*  €  Sj2  is  an  extreme  direction  of  SJ2 
if  and  only  if  it  satisfies  one  of  the  following  equations 


n(21-2l)=  0 
,n(p3-?2)  =  n[-3,o]r  =  o 
n(di-20  =0 
n(dx  -5a)  =n[2,-i]r  =  o. 


(3.11) 

(3.12) 


Therefore,  there  are  two  extreme  directions  [0. 1]  correspond¬ 
ing  to  (3.11)  and  [1,2]  corresponding  to  (3.12).  To  illustrate 
Lemma  6,  consider  the  extreme  direction  II*  =  [0. 1]  deter¬ 
mined  by  (3.11).  In  this  case,  II* ( dx  -  dx)  =0  corresponds  to 
(3.8a)  and  II*( p3  -  p2)  =  0  corresponds  to  (3.8b)  and  k'  —  1. 
The  following  illustrates  all  satements  of  Lemma  6  one  by 
one.  The  vector  dx  is  a  minimum  dependence  vector  of  II*; 
the  norm  intersection  B2  =  sp{[0,  -1]}  found  in  Example  3.1 
is  the  one  that  II*  belongs  to  and  k  -  kf  =  1;  B  =  [0,  -1], 
H  =  [0,0]r,  and  rank(BA)  =  0  =  k  -  1;  and  H*  can 
be  obtained  equivalently  from  equations  II*  =  a[0,  -1]  and 
q[0|  =  0  corresponding  to  (3.9)  and  (3.10),  respectively.  End 
of  example. 

Proof  of  Theorem  3.1:  By  Lemmas  1-4,  the  solution  space 
5  is  a  convex  cone;  5  is  partitioned  into  convex  subcones 
Sth,  t  =  l,--,m  and  h  =  1, •••,?;  and  over  5t*  one 
of  the  extreme  directions  is  optimal.  Therefore,  one  of  the 
extreme  directions  of  Sth,  t  =  1,  •  •  • ,  m  and  h  =  1,  •  •  • .  q, 
must  be  an  optimal  direction  of  the  objective  function  / 
described  by  (2-5)  over  S.  By  Lemmas  5  and  6,  II*  6  Sth 
is  an  extreme  direction  of  Sth  if  and  only  if  it  is  a  solution 
of  (3.8),  or  equivalently,  a  solution  of  (3.9)  and  (3.10)  that 
has  only  one  linearly  independent  solution.  In  other  words, 
II*  €  Sth  is  an  extreme  direction  of  Sth  if  and  only  if  it  is 
the  only  one  linearly  independent  solution  of  (3.9)  and  (3.10) 
for  some  A: -dimensional  norm  intersection  of  J,  where  1  < 
k  <  rank {D),  and  some  k  linearly  independent  dependence 
vectors  that  are  minimum  dependence  vectors  of  II*  and,  in 
(3.10),  rank (BH)  —  k  -  1.  Now,  what  Procedure  3.1  does  is 
to  consider  all  possible  combinations  of  norm  intersections 
with  ail  distinct  sets  of  linearly  independent  dependence  vec¬ 
tors,  obtain  a  row  vector  IIj  by  solving  (3.9)  and  (3.10),  check 
the  feasibility  of  H(,  rank  of  matrix  BH  and  if  the  dependence 
vectors  are  minimmn.  If  Hi  is  feasible,  rank(AB)  =  k.  —  1 


HANG  AND  FORTES:  TIME  OPTIMAL  SCHEDULES 


735 


nd  the  k  dependence  vectors  are  minimum  vectors  of  II(, 
hen  it  is  included  in  C°.  So  IF  e  S,/,  is  an  extreme  direction 
>t  Sti i,  t  =  1  ,•••,771  and  h  =  1,  •  •  • ,  q,  if  and  only  if  it 
s  included  in  C°  which  means  that  the  optimal  solution  11° 
nust  be  in  C°.  □ 

Complexity  Analysis 

The  complexity  of  Procedure  3.1  is  a  constant  in  terms  of 
he  cardinality  of  the  index  set  J,  i.e.,  it  is  independent  of 
he  size  of  the  algorithm.  However,  it  is  a  function  of  a, 
<>,  and  n,  i.e.,  of  the  number  of  rows  of  .4,  the  number 
>f  the  dependencies,  and  the  dimension  of  the  index  set, 
espectively.  Clearly,  either  Step  1  or  Step  5  dominates  the 
vhole  procedure.  Step  1  identifies  all  norm  intersections  of 
/.  For  1  <  k  <  min{n-l,a},  there  are  at  most  (£) 
listinct  /c-dimensional  norm  spaces  of  J.  For  k  =  min{n,a}, 
here  is  at  most  one  ^-dimensional  norm  space  that  is  Rn 
f  a  >  n.  So,  there  are  at  most  1  4-  norm 

-oaces  of  J  and  there  are  at  most  (1  4-  (£))2 

listinct  norm  intersections  of  J.  To  find  each  norm  inter- 
ection,  at  most  0(n3)  operations  are  needed  to  solve  (3.3). 
i'herefore,  the  total  operations  that  Step  1  takes  is  0((1  4- 
-mmio.n-1}  (“))2n3).  For  Step  5,  if  it  is  assumed  that  there 
re  m*  distinct  A:-dimensional  norm  intersections  of  J,  then 
nere  are  at  most  £^1k<D)  mfc(™)  iterations.  Each  iteration 
leeds  0(k3)  operations  to  solve  (3.10).  Therefore,  Step  5 
ikes  0(E^(D)  mfc(*)fc3)  operations  and  the  complexity 
•t  Procedure  3.1  is  bounded  above  by 


rainla.n— 1}  \ 

E  OK 


rank(  D) 

E 

fe=i 


(3.13) 


>ue  to  the  fact  that  ££=0  (*)  ~  2"-  il  follows  that  1  + 
(;)  <  (;)  <  C)  =  2“  and 

( ” )  <  ( T )  =  2m-  Because  there  are  at  most 

norm  spaces  of  J,  there  are  at  most  2a  -  1  distinct  k- 
iimensional  norm  intersections  of  J,  i.e.,  m*  <  2°.  Therefore, 
he  complexity  of  Procedure  3.1  is  bounded  above  by 

0{22ani)  4-  0(2an32m)  <  o(2°2max{a-m}n3).  (3.14) 

Vhen  a  =  2n  >  m,  the  complexity  of  Procedure  3.1  is 
our>ded  above  by 


0(2'tnn3). 


(3.14a) 


\.s  it  can  be  seen  from  the  derivation,  the  upper  bound  of 
rocedure  3.1  given  by  (3.14)  is  very  loose.  It  is  remarked  that 
ie  procedure  takes  time  exponential  in  the  number  of  rows  of 
index  constraint  matrix  A  and  the  number  of  dependence 
,-ctors  of  the  algorithm  rather  than  in  the  cardinality  of  its 
idex  set.  i.e.,  it  is  independent  of  the  size  of  the  algorithm, 
i  most  cases  of  practical,  interest  the  values  of  a  and  m  are 
)t  large  and,  as  mentioned  before,  when  the  index  set  of  the 


algorithm  is  a  hyperparallelepiped.  Step  1  of  Procedure  3.1 
can  be  omitted.  This  reduces  the  complexity  of  Procedure  3.1 
to  0(22nn3)  as  explained  later  in  Section  V.  It  follows  that 
this  procedure  is  efficient  for  most  algorithms. 

IV.  Comparison  to  a  Linear  Programming  Approach 

This  section  discusses  an  alternative  procedure,  based  on 
linear  programming,  to  find  the  time-optimal  linear  sched¬ 
ule  vector  IP  and  compares  its  complexity  with  that  of 
Procedure  3.1.  The  comparison  indicates  that  Procedure  3.1  is 
less  computationally  expensive  than  the  linear  programming 
approach.  However,  because  there  are  many  existing  tech¬ 
niques  to  implement  linear  programming  problems,  it  might 
be  an  alternative  approach  for  some  algorithms.  Variations  of 
the  linear  programming  approach  presented  in  this  section  are 
studied  in  (19],  [27],  and  (28]. 

As  a  direct  consequence  of  Lemmas  1, 2,  and  3,  Problem  2.1 
can  be  formulated  as  a  set  of  linear  programs  as  follows.  In 
each  Sth.,  a  subset  of  the  solution  space  5  defined  in  (3.7), 
Problem  2.1  is  described  as 

.  nph 

mm  _ 

Tldt 

(  n dt  <  Udi,  i  =  l.-.-.m 


/  =  !.■■• 


subject  to<  UD  >  0 


R(Ph-Pi)  ^  °- 


As  mentioned  before,  II  is  a  function  of  directions  of  the  cone 
Sth •  Clearly,  everydirection  II  corresponds  to  a  point  11'  such 
that  displ 1'  =  U'dt  =  1.  In  other  words,  it  is  good  enough 
to  consider  only  these  feasible  solutions  II'  €  Sth  such  that 
dispU.'  =  U'dt  —  1.  Therefore,  (4.1)  is  equivalent  to 


min  —=r 
II  dt 


ndt  =  l 


subject  to<  ILD  >  1  (4.2) 

>  0.  1  =  1,  -.q. 

The  constraints  UD  >  0  and  lid*  <  II <L,  i  =  1,  •  •  ■ .  m,  in 
(4.1)  are  implied  by  ILD  >  1  and  Ildt  =  1.  Because  the 
denominator  of  the  objective  function  is  equal  to  one,  (4.2)  is 
equivalent  to 


min  IIph 


Udt  =  1 


subject  to  <  ILD  >  1 

i  n(  ph  -  pi)  >  0. 


I  =  1,  •••?. 


Clearly,  (4.3)  is  a  linear  programming  problem.  Let  lit/,  be 
the  optimal  solution  of  linear  program  (4.3),  then  the  optimal 
solution  IF  of  Problem  2.1  belongs  to  the  set  {13,/,,  t  = 
1,  • ,  m,  and  h  =  1.  •  •  • ,  q}.  In  other  words,  II0  can  be  found 

by  the  following  procedure. 

Procedure  4.1  (Finding  11°  by  linear  program  approach): 
Input:  Algorithm  (J,  D)  whose  index  set  J  is  bounded. 


736 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL.  40,  NO.  6,  JUNE  1991 


Output:  The  optimal  solution  II®  of  Problem  2.1. 

Step  1:  Identify  all  extreme  points  eit  i  =  1,  •  •  • ,  e,  e  €  N+, 
of  index  set  J. 

Step  2:  Construct  the  set  of  projection  vectors  P  =  {ph  = 
Si  -  efc :  1  <  i,  k  <  e}.  Let  jP|  =  q. 

Step  3:  For  projection  vector  ph  and  dependence  vector  dt, 
h  =  1, •••,<?  and  t  =  formulate  linear 

program  (4.3)  and  find  the  optimal  solution  lit/,  of 

(4.3)  by  applying  existing  techniques  (e.g.,  simplex 
method). 

Step  4:  Compare  the  total  execution  times  by  Uth,  t  = 
1,  ■  •  • ,  m  and  h  =  1,  -  •  • ,  q.  The  one  with  minimum 
execution  time  is  the  optimal  solution  II®.  Stop. 

There  exist  many  techniques  to  solve  linear  programs  (4.3) 
such  as  the  simplex  method  and  the  ellipsoid  method  [29]. 
Consider  a  linear  program  max{  C'x :  A'x  <  t/}  and  its  dual 
min{ry :  Y A!  =  C\Y  >  0}  where  C‘  €  Qlxl,  A'  S  Qnxl 
and  h7  €  <2nxl  are  given  and  Y  €  Qixn,x  S  Qlxi  are 
unknown  vectors.  The  complexity  of  the  ellipsoid  method  for 
this  linear  program  is  0(ls  ■  log2  T)  where  T  is  the  maximum 
absolute  value  of  the  entries  in  A '  or  V  and  l  is  the  number 
of  columns  of  matrix  .4'  [29,  p.  170].  The  complexity  of  the 
simplex  method  for  the  worst  case  of  this  linear  program  is 
0(2'n3)  [29,  p.  139]. 

The  complexity  of  the  linear  programming  approach  is  as 
follows.  As  defined  in  Assumption  2.1,  integer  a  is  the  number 
of  rows  of  the  index  constraint  matrix  .4.  Clearly,  there  are  at 
most  I“)  extreme  points  of  the  index  set  J.  Therefore,  to 
find  all  these  extreme  points,  at  most  0^(“)n3)  operations 
are  needed.  There  are  at  most  ((“)  -  LJ  projection  vectors 
(notice  that  e<  -  e*  is  different  from  e*  -  in  general), 
i.e.,  q  <  ((*)  -  l)2.  So,  for  Step  2  to  construct  the  set 
of  projection  vectors,  at  most  0(((“)  —  l)“  •  n)  operations 
are  needed.  In  the  worst  case,  there  are  ((“)  -  l)  m  linear 
programs  each  of  which  is  described  by  (4.3).  Linear  program 

(4.3)  can  be  rewritten  as 

min[np„ :  nd»  =  1, HA"  >  C"}  (4.4) 

where  A"  =  [D,  ph -pv  ■■■,Ph  -?,]  and  C"  =  (1, 5] .  There 
are  l  =  m  +  q  columns  in  matrix  A".  Linear  program  (4.4) 
can  be  transformed  into  standard  linear  program  minjyy  : 
y.4'  —  C' ,Y  >  0}  by  some  techniques  (e.g.,  slack  variables 
or  surplus  variables  [31,  p.  12])  and  there  at  least  l  —  m  ■+■  q 
columns  in  A'. 

If  the  simplex  method  is  applied  to  linear  program  (4.4), 
then  the  complexity  of  solving  one  linear  program  in  Step  3  is 
0(2'n3)  where  l  -  m  4-  q  =  m  +■  ((“)  -  l)2  is  the  number 
of  columns  of  A".  Therefore,  the  number  of  operations  needed 
by  Step  3  is  0(m  ■  q-2m+<’-  n3)  =  0(m({  ®)-l)22""<^ 
n3)  where  m  ■  q  is  the  number  of  linear  programs.  Clearly, 
Step  3  dominates  the  whole  procedure  and  the  complexity  of 
Procedure  4.1  by  the  simplex  method  is 

<«j) 


if  - = cf  =  (r;.o  o’)’  >  e;.o  o-2” 

and,  when  the  simplex  method  is  used,  the  upper  bound  for 
the  complexity  of  Procsdure  4.1  is  at  least 

O  ^2n2m22n  mn3J .  (4.6) 

If  the  ellipsoid  method  is  applied  to  solve  linear  program 

(4.4) ,  the  complexity  of  the  linear  programming  approach  is 
0(1 6  log2  T  ■  m  ■  q)  where  1  =  m  +  q  =  m  +  ((*)  —  l)2  is 
the  number  of  columns  of  A"  in  (4.4)  and  m  •  q  is  the  num¬ 
ber  of  linear  programs.  As  mentioned  befor^  T  is  the  maxi¬ 
mum  absolute  value  of  the  entries  in  A"  or  V.  Because  some 
entries  of  extreme  points  are  linear  functions  of  entries  of  the 
size  vector  b  (e.g.,  extreme  point  e4  in  Fig.  1),  some  entries 
of  the  projection  vectors  and  matrix  A"  are  linear  functions  of 
entries  of  the  size  vector  b  which  characterizes  the  cardinality 
of  the  index  set  (size  of  the  algorithm).  Therefore,  in  addition 
to  a  large  proportionality  constant,  the  complexity  of  the 
ellipsoid  method  is  a  logarithmic  function  of  the  cardinality 
of  the  index  set  J  which  is  not  desirable. 

Compared  to  Procedure  4.1,  Procedure  3.1  has  at  least  two 
advantages.  First,  Procedure  3.1  is  computationally  less  expen¬ 
sive  than  Procedure  4.1.  This  is  clearly  shown  by  (3.14)  and 

(4.5)  and  (3.14a)  and  (4.6).  One  of  the  reasons  for  this  is  that 
the  linear  programming  approach,  possibly,  processes  some 
extreme  directions  of  Sth  more  than  once  if  these  extreme 
directions  belong  to  more  than  one  subcone  St/,.  This  can 
be  illustrated  by  Fig.  3  that  shows  the  solution  space  of  the 
algorithm  of  Example  2.1.  In  Fig.  3,  extreme  direction  [0. 1] 
belongs  to  both  Si2  and  Si3.  So,  using  the  linear  programming 
approach,  that  direction  is  considered  twice,  one  by  the  linear 
program  for  S 12  and  the  other  by  the  linear  program  for  5 13. 
In  contrast,  it  is  processed  only  once  by  Procedure  3.1. 

Second,  as  mentioned  in  Section  III-A,  Procedure  3.1  can 
be  implemented  at  compile-time  because  it  does  not  require 
knowledge  of  the  size  vector  b.  It  is  necessary  to  apply 
Procedure  3.1  only  once  to  get  all  solutions  for  all  possible 
different  instances  of  the  algorithm.  In  contrast,  for  the  linear 
programming  approach,  it  needs  to  know  the  size  vector  6  to 
determine  extreme  points  and  projection  vectors  to  apply  the 
simplex  method.  So,  the  linear  programming  approach  can  be 
applied  only  when  the  size  vector  b  is  known,  possibly,  at 
run-time.  For  each  instance  of  the  same  algorithm,  in  general. 
Procedure  4.1  may  have  to  be  applied  once  to  find  the  optimal 
solution. 

V.  Solution  of  the  Time  Optimal  Linear  schedule 

PROBLEM — THE  HYPERPARALLELEPIPED  INDEX  SET  CASE 

In  most  applications  the  index  set  J  is  simply  a  hyperparal¬ 
lelepiped  (see  Definition  5.1  below).  Because  of  the  regularity 
of  the  index  set,  it  is  reasonable  to  expect  Procedure  3.1  to 
become  easier  for  this  class  of  algorithms.  In  [12],  a  solution 
of  Problem  2.1  is  proposed  for  algorithms  where  the  index 
set  J  is  a  hyperparallelepiped.  In  this  section,  some  of  the 
results  in  [12]  are  restated  as  Corollary  5.1  and  proven  in  a 
different  way.  Their  notations  are  used  as  much  as  possible  in 
thia  section  and  introduced  next. 


SHANG-  AND  FORTES:  TIME  OPTIMAL  SCHEDULES 


737 


Throughout  this  section  the  shape  of  the  index  set  of 
oil  algorithms  under  consideration  is  assumed  to  be  like  a 
hyperparallelepiped,  i.e.,  defined  as  follows. 

Definition  5.1  (HyperparaUelepiped  index  set):  An  index  set 
./  is  a  hyperparallelepiped  if 

J  =  {(ji. '  •  *  Jnf :  0  <ji<  suji  €  Z,  *  G  N+, 

i  as  1,  •••,!*}•  (5.1) 

Clearly,  if  7  is  a  hyperparallelepiped,  then  7  =  { J :  A]  < 
b.j  €  Zn }  where  A  =  ^j,  b  —  j^j,  and  s  =  [«i,  •  •  • , •Jn]T- 
A  norm  vector  for  any  boundary  surface  of  7  is  of  the  form  Ei 
or  —Ei,  1  <  i  <  n,  where,  as  defined  in  Section  II,  Ei  denotes 
the  row  vector  whose  entries  are  all  zeros  except  that  the  :'th 
entry  is  one.  Any  norm  space  has  the  form  sp{E^ ,  •  •  • ,  Er, } 
where  1  <  r  j, ,  •  •  • ,  ri  <  n  and  1  <  l  <  n.  Therefore,  any 
norm  intersection  can  choose  {Eri ,  •  •  • , Erk },  1  <  k  <  n,  as 
its  basis  and  is  equal  to  some  norm  space. 

Let  D(ci  •  •  •  c®/ri  ■••rv)  denote  the  submatrix  of  D 
containing  the  elements  in  columns  ci,--*,£x  and  rows 
ri ,---,ryt  i.e.,  it  contains  the  elements  of  D  at  the 
intersections  of  columns  ci, •  •  • ,  c*  and  rows  n,  •  •  •  ,rv. 
If  D(c\  ■  ■  ■  Ch/ri  ■  •  -rfc)  is  nonsingular,  an  integer  row 
vector  V  =  [vt, ••*,«*]  €  Zlxk  is  defined  as  V  = 
7L£7-1(ci  •••Cfc/r1  ■■■rk)  where  0  is  a  positive  integer 
such  that  gcd{v i ,  -  -  • ,  t/*  >  =  1.  In  other  words,  V  is  a 
vector  whose  entries  are  the  sums  of  the  corresponding 
columns  of  D~l(ci  ■  •  -Ck/ri  •  •  -t*)  scaled  so  that  they  are 
integers  with  the  greatest  common  divisor  equal  to  the 
unity.  If  D(ci-Ck/rf-rk)  is  nonsingular,  then  define 

Erx' 

n(ci  •  ■  ■  Ck/n  ■■■  r*)  =  VB,  where  B  =  •  •  •  .In  words, 

[Erk\ 

the  subvector  (*,.,,•••,  irrJ  of  II(ci  •  • -Cfc/n  •  • -r*)  is  the 
same  as  V  and  the  remaining  entries  of  Ilfcj  •  •  •  c*/rj  •  •  •  n,) 
are  zero.  Finally,  CJ  denotes  the  set  {II(ci  •  •  •  Ck/ri  ■•■rk): 

1  <  k  <  n,  1  <  ci  <  •  •  •  <  Cfc  <  m,  1  <  n  <  •  •  •  <  rfe  <  r»}. 

The  following  example  illustrates  the  notations  and  concepts 
just  introduced. 

Example  5.1:  Consider  the  algorithm  (7,  D)  of  Example  3.4 
where  7  =  {[jith}*:  0  <  h  <  3i,0  <  <  s?,3i,32, 

f2  0  1 

jx,h  €  Z}  and  D  -  2  3  .  This  algorithm  has  the  same 

dependence  matrix  as  the  algorithm  of  Example  2.1.  Ac¬ 
cording  to  the  definitions  just  introduced.  £7(1/1)  =  [2] 
(contains  the  entry  in  the  first  column  and  the  first  row), 
its  corresponding  V  =  7l£7~l(l/l)  =  [1]  where  7  =  2, 
3  =  Ei  =  (1.0),  and  11(1/1)  =  VB  =  [1,0);  £>(1/2)  =  [2], 
its  corresponding  V  =  Q\D~l{\/2)  =  (1)  where  0  =  2,  B  = 
Ei  =  (0. 1]  and  11(1/2)  =  VB  =  [0,  lj;  and  £7(12/12)  =  D, 
its  corresponding  V  =  7l£7_1  =  [1,2]  where  0  =  6,  B  =  I, 
and  11(12/12)  =  VB  =  V  =  [1,2].  End  of  example. 

Corollary  5.1 :  If  the  index  set  7  of  algorithm  (7,  D)  is  a 
hyperparallelepiped  defined  by  (5.1),  then  the  optimal  solution 
n°  of  Problem  2.1  for  (7,  £?)  belongs  to  CJ,  i.e., 

IT  €  CJ  =  {n(c,  ■  'hk/ri  ■  ■  rk)  :  1  <  le  <  n.l  <  cj 
<  •  •  •  <  Cfc  <  m,  1  <  ri  <  •  •  •  <  rk  <  n>- 


(52) 

Proof:  Provided  in  Appendix. 

Procedure  5.1  (Construction  of  CJ): 

Input  Algorithm  (7,  D)  where  7  is  a  hyperparallelepiped. 
Output:  A  finite  candidate  set  CJ  containing  the  optimal 
solution  of  Problem  2.1. 

Step  1:  k  =  1,  l  =  1. 

Step  2:  Cj  =  0. 

Step  3:  Pick  an  unprocessed  combination  of  k  elements  from 
{1,  •••,»}.  Denote  it  {ri,---,rfc}. 

Step  4:  Pick  an  unprocessed  combination  of  k  elements  from 
{l,---,m}.  Denote  it  {ci,---,Ck}. 

Step  5:  If  £?(ci  •  •  •  Ck/n  •  •  •  rfc)  is  not  singular,  then  Hf  = 
II(ci  ••  -Cfc/rx  •••rk)  andCj  =  Cju{IIi}, l  =  1+1. 
Step  6:  Check  if  all  distinct  combinations  of  k  elements  from 
{!,••■  ,m}  have  been  processed.  If  not,  go  to  Step  4. 
Step  7:  Check  if  all  distinct  combinations  of  k  elements  from 
{!,•••,»}  have  been  processed.  If  not,  go  to  Step  3. 
Step  8:  If  k  <  rank(£7),  then  k  =  k  + 1,  go  to  Step  2. 

Step  9:  CJ  =  Uk^f^  C*.  Stop. 

Due  to  the  fact  that  all  norm  intersections  of  7  are  known. 
Procedure  5.1  does  not  compute  distinct  norm  intersections. 
This  makes  its  complexity  lower  than  Procedure  3.1.  Clearly, 
Step  S  dominates  the  whole  procedure.  For  Step  5,  there  are 
(J)  distinct  combinations  of  k  elements  from  {1, •••,»} 
and  there  are  (™)  distinct  combinations  of  k  elements  from 
> m).  So  tere  are  (J)  (?)  =  (,£$>) -1 

iterations  for  Step  5.  Each  iteration  needs  0(k3)  operations  to 
compute  the  inverse  D~1(ci  •  •  ■  Ck/n  •  •  •  rfc)  if  it  is  nonsingu¬ 
lar.  Therefore,  the  complexity  for  Procedure  5.1  is  bounded 
above  by  0(((r^[2»)  ~  1)n3)-  u  raak(D)  =  »  *  m, 

ten  Sis™  C)  o  =  ZUi  O’  <  (Z3.I  C))J  = 

(2n)2  =  2in  and  the  complexity  of  Procedure  5.1  is  bounded 
above  by  0(22nn3). 

Some  comments  are  made  here  regarding  how  the 
candidate  set  CJ  constructed'  by  Procedure  5.1  is  related 
to  C°  constructed  by  Procedure  3.1.  First,  each  candidate 
II(ti  •  •  •  tk-it/rx  •  •  •  rk)  in  CJ  corresponds  to  the  combination 
of  the  norm  intersection  jp{Eri ,  ETk }  and  the  set  of 
dependence  vectors  dt,  dtt ,■••,7**.,.  This  is  shown  in 
the  proof  of  Corollary  5.1  provided  in  the  Appendix.  So 
H(*i  •••tfc.it/iVTfc)  belongs  to  the  solution  space  of 
(3.4).  This  can  be  verified  as  follows.  Let  II  =  II(ti 
•  •  •  tk-\t/r\  — rfc),  then  II  =  VB  where  V  =  01 D~l 

(fi  •  •  •  tk-it/ri  ■•■rk)  and  B  =  •  •  •  .  Clearly,  B[dt, , 

=  D(ti- -tk-it/ri  ■■■r_k).  So 
=  71,  *•*•»  ndt,  =  •••  =  Hdtt_l  =  Hdt  =  7. 
Therefore,  II  is  a  solution  of  (3.4). 

Second,  the  candidate  set  C°  constructed  by  Procedure  3.1 
is  a  subset  of  CJ,  i.e„  C°  C  CJ  for  algorithms  whose 
index  sets  are  hyperparallelepiped  Let  II  6  C°  be  de¬ 
termined  by  a  4-dimensional  norm  intersection  B  and  k 
linearly  independent  dependence  vectors  dt,  dt, ,  •  •  • ,  dtk_,  •  As 


738 


QEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  40.  NO.  6.  JUNE  1991 


mentioned  before,  a  basis  for  B  is  {Eri,---  ,Erk}.  Then 
II  =  n(*i---*fc_it/r,...rfc)  €  CJ.  As  a_ matter  ofjact, 
C°  =  {II:  II  €  C%,  II  or  -II  is  feasible  and  •  ,dk-i 

are  minimum  dependence  vectors  of  II}.  This  is  true  because 
each  II(fi  •  •  •  tk-it/ri  r*)  corresponds  to  a  combination  of 
a  k-dimensional  norm  space  sp{Er±,  ,  with  k  linearly 
independent  dependence  vectors  dt,du,  ■  •  •  .dtk_l  and  vice 
versa  and  Cj]  includes  all  the  row  vectors  corresponding  all 
combinations  as  long  as  D~l(ti  ■  ■  ■  tk-it/ri  ■  ■  ■  rk)  is  nonsin¬ 
gular.  i.e.,  the  condition  rank(5/f)  =  k  —  1  in  Section  III-A 
is  met. 

Third,  for  II(ti  •  •  •  tk-it/ri  ••  •  rfc),  the  order  by  which 
1 1 ,  ■  ■  • ,  ti,-i  and  t  appear  and  the  order  by  which  ,•••,  r*.  ap¬ 
pear  do  not  matter.  As  an  example,  II(t  i  •  •  •  tk-it/ri  ■■■ne)  = 
nfott  •  •  •  tk-it/n  =  n(t2*i  •  •  •  tk-it/r^ri  •  •  •  rfc)  = 
n(fi  •  •  •  tk-it/r^ri  ■  •  •  r*).  This  is  due  to  the  fact  that  they 
are  all  determined  by  the  same  combination  of  a  ^-dimensional 
norm  intersection  ap{Ers,  •  •  • ,  Erk }  with-  a  set  of  k  linearly 
dependence  vectors  dt,aLtx,--- ,dtk_^.  Next,  an  example  is- 
used  to  illustrate  Procedure  5.1. 

Example  5.2:  Consider  the  algorithm  of  Example  5.1.  Let 
fc  =  1,  there  are  two  possible  combinations  of  one  element 
from  {!,••■,«}  and  two  possible  combinations  of  one  element 
from  {1.  •  •  •  ,m}  (n  =  m  =  2).  So,  £>(1/1),  £>(  1/2), 
D(2/l),  and  D{ 2/2)  are  processed.  Each  corresponding  vector 
is  obtained  as  follows.  D{  1/1)  =  [2]  and  11(1/1)  =  [1,  Oj; 
D{ 2/1)  =  [0]  is  ignored  because  it  is  singular;  £>(1/2)  =  [2] 
and  n(l/2)  =  (0,  lj;  D{ 2/2)  =  [3]  and  H(2/2)  =  (0, 1].  For 
k  ss  2.  there  is  only  one  combination,  i.e.,  £>(12/12)  =  D 
and  11(12/12)  =  (1, 2).  So  CJ  =  {(1,  Oj,  (0, 1],  [1, 2)}.  H(l/1) 
is  not  feasible.  If  it  is  removed  from  C\,  then  C°  is  obtained 
as  {[0,  lj',  [1,2]}  which  is  the  same  as  the  one  constructed  by 
Procedure  3.1  in  Example  3.4.  The  total  execution  time  by 
each  candidate  is  as  follows. 


*([0.1])- 


[0,lj([(U2lr-(0.()lr)  +  l 

2 


*([1.2]) 


[1. 2]  ([■»!,  s2]r  —  [0, 0jr)  4- 1 

6 

Si  4-  2 s2  4-  ll 

6  r 


32  +  1 
2 


If  si  »  32,  then  the  optimal  solution  of  Problem  2.1  is 
n°  =  0(1/2)  —  [0,1],  IP  is  perpendicular  to  the  surface 
surf{[0, 1]}.  Compared  to  Example  3.1,  these  two  algorithms 
have  different  optimal  linear  schedule  vectors  although  they 
have  the  same  dependence  matrix.  This  is  due  to  the  fact  that 
they  have  different  index  sets.  End  of  example. 


vi.  Conclusions  and  Future  work 

In  summary,  this  paper  provides  a  solution  to  the  problem 
of  identifying  optimal  linear  schedules  for  the  execution  of 
algorithms  with  uniform  dependences.  The  complexity  of  the 
proposed  procedure  is  independent  of  the  size  of  the  algorithm 
and  is  exponential  in  the  dimension  of  its  index  set.  For 
practical  algorithms  the  index  set  has  rather  small  dimension 
and  the  procedure  is  therefore  very  efficient.  In  comparison  to 


extant  work,  this  contribution  considers  a  larger  class  of  index 
sets  and  linear  schedules.  It  is  also  shown  that,  when  index 
sets  are  hyperparallelepipeds,  the  procedure-  is  considerably 
simpler  and  faster.  The  proposed  optimization  procedure  can 
also  be  used  to  identify  optimal  schedules  for  the  execution 
of  a  specific  computation  [16]. 

In  MIMD  machines,  the  total  execution  time  of  an  algorithm 
consists  not  only  of  the  time  required  to  execute  all  of  its  com¬ 
putations  but  also  of  the  time  spent  in  coordinating  concurrent 
operations,  e.g.,  in  data  communication  and  synchronization. 
The  partitioning  techniques  studied  in  [15],  [25],  and  [26]  are 
potentially  applicable  to  the  problems  of  selecting  the  linear 
schedules  which  minimize  the  time  spent  in  the  coordination  of 
concurrent  operations.  Together  with  the  scheduling  approach 
discussed  in  this  paper,  those  techniques  may  be  of  use  to 
identify  optimal  schedules  for  uniform  dependence  algorithms 
executed  in  MIMD  machines.  This  topic  will  be  addressed  in 
future  research. 


Appendix 

Proof  of  Lemma  1:  By  definition, 

m 

S  =  P|  [II:  nSi  >  0}.  (a.l) 

is  I 

Clearly,  {II:  Il5j  >  0}  is  a  convex  cone.  By  Theorem  2.5  in 
[17]  which  states  that  the  intersection  of  an  arbitrary  collection 
of  convex  cones  is  a  convex  cone,  5  is  a  convex  cone.  For  the 
objective  function  /  in  (2.5),  /(II)  =  /(All)  for  any  nonzero 
constant  A.  Therefore,  /  is  a  function  of  directions  in  5.  □ 

Proof  of  Lemma  2:  Obviously,  St  h  is  a  convex  cone  because 
it  is  an  intersection  of  convex_cones.  By  definition  of  Sth, 
for  all  directions  II  e  Sth,  II <U  <  ILL,  i  -  1,  -  •  • ,  m  and 
TIph  >  Rpk,  k  =  1.  ■  •  • ,  q.  Therefore,  dt  is  the  minimum 
dependency  of  II  6  Sth,  and  ph  is  the  maximum  projection 
vector  of  II  £  Sth,  i-e.,  IlS*  =  minflld,-  :<£•€£>}  and 
rip*  =  max{IIpi:  ^  €  P}  =  maxfllO'!  -J2):  juj2  €  J}. 
Thus, 

max{n(Jt  -J2):  JVJ2  €  J)  _  n?h 

min{n^:5i6D}  IM*  **' 

□ 


Proof  of  Lemma  3:  For  every  II  €  5,  there  must  exist  dt 
and  ph  such  that  dt  is  the  minimum  dependency  and  ph  is  the 
maximum  projection  vector  which  implies  that  there  exists  a 
subcone  Sth  such  that  n  €  Sth.  So  S  C  (J™  t  UiLi 
By  definition  of  Sth  [see  (3.6)),  for  every  II  €  Sth,  II  €  S. 
Therefore.  Sth  C  5  which  means  that  S  D  (J*Li  Ul-t  •£«/»• 
Therefore,  S  =  ULi  Stfc-  □ 

Proof  of  Lemma  4;  In  Sth,  Problem  2.1  can  be  formulated 
as 

where  Sth  is  defined  by  (3.7).  According  to  [11,  Section  11.4], 
this  is  a  linear  fractional  programming  problem  and  one  of  the 
extreme  directions  is  the  optimal  direction  of  /  over  Sth •  □ 


-.HANG  AND  FORTES:  TIME  OPTIMAL  SCHEDULES 


739 


Proof  of  Lemma  5:  The  idea  behind  this  proof  is  as  follows. 
rirst,  if  II*  is  an  extreme  direction  and  there  does  not  exist  a  set 
>f  n-1  linearly  independent  vectors  dt-dtx,-- 
phi  -  ph,  ■  ■  ■  ,ph^  -  ph  such  that  II*  satisfies  (3.8)  then 
"I*  can  be  written  as  a  positive  linear  combination  of  two 
Jistinct  directions  IIi,  II2  €  Sth  (Hi  and  II2  are  distinct  if 
fit  /  all2  for  any  nonzero  constant  a).  By  definition  of 
an  extreme  direction  (14,  p.  55],  II*  cannot  be  an  extreme 
direction  which  is  a  contradiction.  Second,  it  is  shown  that 
if  there  exists  a  set  of  n  -  1  linearly  independent  vectors 

-P\,  --,Phn_k,  ~Ph  such  that 
n*  satisfies  (3.8),  then  II*  cannot  be  written  as  any  positive 
linear  combination  of  two  distinct  directions  of  Sti >,  i.e.,  II* 
is  an  extreme  direction.  The  formal  proof  is  as  follows. 

1)  (=>)  A  direction  II  of  a  convex  set  II  is  called  an 
extreme  direction  if  it  cannot  be  written  as  a  positive  linear 
combination  of  two  distinct  directions  in  ft,  that  is,  if  II  = 
A1II1  4-  A2II2  where  Ax,  A2  >  0  and  Hi,  II2  6  ft,  then 
IIi  =  all2  for  some  constant  a  >  0  [14,  p.  55].  By  definition, 

sth  =  {n:  n(3t-3i,  •  •  •  ,3*  -  a*.  * -pk,  •  •  •  .*,-&]  <  5, 

-II(3x,'--,dm]  <  5}.  Without  loss  of  generality,  let  c,-  = 
~lt  -  <L  for  i  =  1,  •  •  • ,  m,  and  Zi+m  =  pt  —  p*  for  l  =  1,  •  •  • ,  <7, 
i.e..  Sth  —  {II:  II(ci, •  •  •  ,CrI  <  0,  —  n(di,  ■  •  -  ,dm\  <  0}, 
where  r  =  m  +  q.  Suppose  that  II*  is  an  extreme  direction  in 
Sth  and  there  does  not  exist  a  set  of  n  —  1  linearly  independent 
vectors  c<( ,  •  •  • ,  c(,_,  such  that  II*  satisfies  (3.8).  Because 
II*  6  Sth,  it  satisfies  the  following  equations: 

n[ci,  •  ■  •  ,Cr]  <  5 

-n[3i,---,3m]  <0. 

Assume  that  there  are  3  vectors  2*, ,  •  •  •  ,2»,,  a  €  Z,  such  that 
n«2t.  =  0,  i  =  l.---,a,  andrank([cti,---,ct,])  =  s'  <  n-1. 
Without  loss  of  generality,  let  us  assume  U  =  i,  i  —  1,  •  •  • ,  s, 
i.e.,  II*  is  a  solution  of  (a.2)  and  satisfies  (a.3)  below: 

IIci  =0 

lie,  =  0 

and 

'  Ifc,+x  <  0 

IKr  <0 

-n2x  <  0 
.  -IIqU,  <  0. 

There  are  at  least  two  independent  solutions  of  (a.2)  because 
V  <  n- 1.  Next,  it  is  shown  that  there  exist  at  least  two  linearly 
independent  solutions  II*  and  II2  of  (a.2)  that  are  in  Sth-  Let 
H'  be  a  solution  of  (a.2)  and  IT  #  7II*  for  any  constant  7, 
i.e.,  IT  and  II*  are  linearly  independent.  Let  II2  =  a'll'+all* 
where  a'  >  0  and  a  >  0  are  constants.  Obviously,  II2  satisfies 
1  a.2).  Now. 

ri*  [c,-41 ,  *  C,,  -*“dx  ,  *  '  ,  d^,J 

=  Qt  TS!  [cJu  •  ■  •  ,  Cy,  ~di,  •  •  •  ,  dm] 

4“  an  ’"t  Ct,  —d\,  -  •  ■ ,  "dm]  • 


(a.2) 


(aJ) 


It  is  always  possible  to  choose  a  large  enough  constant 
a  and  a  small  enough  constant  of  such  that  n2[c,4.i>'“i 
ZrL  -dx,  •  -  • ,  -dm]  <  5  because  n*[ca+1,  •  •  •  ,c*,  -dx,  •  •  • , 
-dm]  <  5.  Thus,  II2  e  Sth,  i-e.,  there  exist  at  least  two 
linearly  independent  solutions  of  (a^)  II2  and  n*  that  are  in 
Sth •  Let  Hx  be  constructed  as  follows: 

IIi  =  All®  -  A2n2  where  A  >  0  and 

A2  >  0  are  constants.  (a.4) 

The  fact  that  n*  ^  7n2  for  any  constant  7  implies  nx  ^  72 n2 
for  any  constant  72.  Thus,  nx  and  n2  are  distinct.  If  it  can 
be  shown  that  nx  €  Sth,  then  by  the  definition  of  extreme 
directions,  n*  is  not  an  extreme  direction  which  is  contrary  to 
the  assumption.  Now,  nx  is  a  solution  of  (a.2)  because  it  is  a 
linear  combination  of  two  solutions  of  (a.2).  For  (a.3) 

Hi  [c,+l,  • '  ■  ,  Cr,  — dx,  •  *  * ,  “dm] 

—  AH  *  * ,  Sr,  —  dx,  •  •  ■ ,  —  dm] 

—  A2n2  [5,4-1 ,  •  •  • ,  Cr ,  — dx , '  *  ’ ,  —dm]  • 

Again,  it  is  always  possible  to  choose  a  large  enough 
constant  A  and  a  small  enough  constant  A2.  such  that 
IIi [5,4-1.  Jr. -dx,---, -dm]  <  0  because  n*[c,4.j, •  •  • , 
Cr,-di,  --,dml  <  0.  Thus,  nx  6  Sth  and  n*  can  be 
expressed  as  a  linear  positive  combination  of  two  other  distinct 
directions  in  Sth,  i.e.,  II*  =  ailli  +  a2II2  where  ax  = 
(1/A)  >  0  and  a2  =  (A2/A)  >  0,  which  means  II*  is  not  an 
extreme  direction.  By  assumption,  II*  is  an  extreme  direction. 
Therefore,  there  must  exist  a  set  of  n  - 1  linearly  independent 
vectors  dt  -  dtl,  •  •  •  ,d*  -  dtk,_,,  phl  -  ph,  •  •  ’,Pn-k'  “  Ph 
such  that  II*  satisfies  (3.8). 

2)  (•$=)  Conversely,  suppose  that  there  exists  a  set  of  n  -  1 
linearly  independent  vectors  ex,  •  •  •  ,5n-i  such  that  II*  is  a 
solution  of  (3.8)  and  it  is  not  an  extreme  direction  of  Sth- 
Then  II*  can  be  expressed  as  a  positive  linear  combination  of 
two  distinct  directions  of  Sth,  i-e*.  there  exist  IIx,  II2  €  Sth 
such  that  II*  =  A1II1  +  A2H2  where  Xu  A2  >  0  and  IIx  # 
all2  for  any  constant  a.  By  bringing  II*  into  (3.8),  then 

AxIIxCi  +  A2II2Ci  =  0  i  —  1,  •  -  • ,  n  —  1.  (a-5) 

Because  IIx,  n2  €  Sth,  AiHiCj  <  0  and  A2H2Cj  <  0. 
Therefore,  the  only  choices  for  Ox  and  n2  that  satisfy  (aJ) 
must  be  such  that 

AilliCj  =  A2n25i  =  0  i  =  1,  ■  •  • ,  n  —  1. 

So,  Hi,  and  II2  are  solutions  of  (3.8).  Because  rank([5x,  •  ■  • , 
Cn-i])  =  n—  1,  there  is  only  one  linearly  independent  solution 
of  (3.8).  So  II*,  Hi,  and  II2  are  linearly  dependent,  i.e.,  II*  = 
axllx  =  a2H2  for  some  nonzero  constants  ax  and  a2.  This 
means  that  Hi  and  n2  are  not  distinct  in  contradiction  with  the 
initial  assumption.  Therefore,  II*  is  an  extreme  direction.  □ 

Proof  of  Lemma  6: 

Proof  of  Statement  a):  Suppose  dt,  dt,,--- ,  dta,_,  are  lin¬ 
early  dependent.  Then  rank([dt,3t,,--- =  kf  -  1 
because  dt  -  2ti,  -,3t  -  dtk,_l  are  linearly  independent 


740 


HERE  TRANSACTIONS  ON  COMPUTERS.  VOL  40.  NO.  6,  JUNE  1991 


Consider  the  following  equation: 

ndt  =  o 

Udt'  =  0  fa.6) 

ndt,,.,  =  o. 

Let  f2i  and  Q2  be  the  solution  space  of  (a.6)  and  (3.8a), 
respectively.  If  II  belongs  to  Oi,  then  II  belongs  to  Q2 
because  U.dt  —  ndt,  =  0,  i  =  1,  •  •  • ,  k1  —  1,  i.e.,  II  satisfies 
(3.8a).  Thus,  Qi  C  fl2.  Furthermore,  Ei  =  H2  because 
dim{fli }  =  dim{fi2}  38  «  —  Id  + 1.  Now  because  IIe  satisfies 
(3.8a),  it  belongs  to  fl2  which  means  it  belongs  to  Cl\  also. 
So  II*  is  not  feasible  because  IPS*.  =  0,  i  =  1,  •  •  • ,  Id  -  1 
which  is  contrary  to  the  assumption  that  II*  €  St*.  Therefore, 
dt,  dtx,  •  •  •  •dtk,_l  must  be  linearly  independent.  Moreover,  dt 
is  a  minimum  dependence  vector  of  II*  because  II*  €  Sti ,.  So 
5t,  dj ,  •  •  • ,  3*/  - 1  are  minimum  dependence  vectors  of  H*  due 
to  the  fact  that  11*3*  =  II* St,  =  •  •  •  =  II* 

Proof  of  Statement  b):  At  this  point,  some  additional  no¬ 
tations  are  needed.  Let  F\  =  surf {At, ,  Att ,  •  •  • ,  Atkl }  anti 
F2  =  surf{AAl ,  Ah,,  •  •  • ,  Aht }  be  boundary  surfaces  of  J, 
Ft  —  {xi  —  x2  :  xi,x2  €  Fr},  r  =  1,2,  IVj  =  ap{Att, 
N2  =  apiAh^  '-fA^}  and  B  =  IVi  D  JV2. 
iVr  is  the  norm  space  of  J  associated  with  Fr,  r  =  1,2,  B 
is  a  norm  intersection  of  J  and  F\  and  F2  are  the  solution 
spaces  of  (a.7a)  and  (a. 7b)  as  follows,  respectively. 

(  At{y  =  0  f  A^y  =  0 

(*)<•••  (b)\  (a-7) 

[.4tk,y  =  0  [Akkiy  =  0. 

Consider  the  projection  veaors  pK,  p*, ,  •  •  • ,  of  (3-8b) 
and,  without  loss  of  generality,  let  ph.  = 

and  h  >  n  -  Id.  By  definition  of  projection  vectors,  p{  is  the 
difference  of  two  extreme  points  of  J,  i.e.,  p{  =  fu  -  e2 » 
for  some  extreme  points  eu  and  e2j,  where  eu  is  the  head 
of  p{  and  e*  is  the  tail  of  pit  i  =  h,  1  -  kf.  Let 

Hr,  r  =  1.2,  denote  the  convex  hull  of  the  extreme  points 
tri,  i  =  h.  1,  •  ■  • ,  n  -  k'.  In  other  words,  H\  and  H2  are  the 
convex  hulls  of  all  the  heads  and  tails  of  projection  vectors 
Ph>  Pi,  '  ’Pn-k"  respectively.  Three  facts  are  proved  next. 

First,  it  is  shown  that  E*  is  perpendicular  to  both  Hi  and 
Hi,  i.e..  E*(2x  -  x2)  =  0,  ?i,  x2  €  Hr,  r  =  1.2.  Because 
n*ex,  =  max{E*x:  x  6  R}  and  E*e2j  =  min{E*x:  2  €  R}, 
1  =  h,  1,  •  •  • ,  n  -  Id,  n*eih  =  E*eii  and  E*e2*  =  E*e2i, 
4  =  1.  •  •  • .  n—kf.  If  J  €  Hr,  r  =  1.2,  then  it  can  be  expressed 
as  a  convex  combination  of  e„,  i  =  h,  1,  •  •  • .  n  -  k' ,  i.e., 
x  =  4-  where  Aw,  i  —  h,  1.  •  *  .  n  ~  Id , 

are  nonnegative  and  Aw,  +■  Aw  =  1  [14,  p.  37].  Thus, 

E*x  =  Ar/,E*erfc  +  AriII*?ri  4 - 1-  Ar(n_fc/)n*er(„_i,») 

=  R'lrh  £  Aw  =  ne®rh- 

i 

Therefore,  for  any  two  points  x\,  x2  €  Hr,  E*2i  =  U*T2  = 
ne?rk,  i.e.,  n*(xi  -  x2)  =  0  which  means  that  E*  is  per¬ 
pendicular  to  both  H\  and  Hi. 


Second,  it  is  shown  that  Hr  is  contained  in  some  boundary 
surface  of  J,  i.e.,  there  exist  two  boundary  surfaces  Fi  and 
Fi  of  J  such  that  Hr  Q  Fr,  r  -  1,2.  Notice  that  for  any 
x  £  R  and  any  row  vector  E,  if  Ex  =  min{Ex :  2  G  R} 
or  Ex  =  max  {Ex  :  x  6  B},  then  2  is  on  some  boundary 
surface  of  J.  As  mentioned  before,  for  any  2  £  Hi,  E*x  = 
E*eu  =  max{E*x  :  x  €  /?}.  So  H\  is  contained  in  some 
boundary  surface  of  J.  Similarly,  because  for  all  x  €  H2, 
E*x  =  E *e2*  =  min{E*x  :  x  €  /?},  H2  is  contained  in 
some  boundary  surface  of  J.  Let  FP  be  the  boundary  surface 
of  J  with  the  largest  dimension  which  contains  Hr  and  is 
perpendicular  to  E*.  i.e.,  Hr  C  Fr  and  E*(xi  -  x2)  =  0,  xlt 
x2  €  Fr,  r  =  1,2. 

Third,  it  is  shown  that  dim{Jri  U  T2}  >  n  —  W .  Now, 
Pk-Pi  -  (etA-«2fc)-(eii-e2i)  =  (eu-eii)-(e2fc-e2i), 
i  =  l,-.-,n  -  ifc',  i.e.,  ph  -  p4  is  a  linear  combination  of 
eih  -  eu  and  e2fl  -  e2it  i  =  1,  •  •  • ,  n  -  A;'.  By  assumption. 
dim{?ft  —  Pi :  i  =  1,  •  •  ■ ,  n  —  Id}  —  n  -  k! ,  so  dim{ex/,  —  eu, 
eik  -  e7i :  i  -  1,  •  •  • ,  n  -  Id}  >  n  -  Id.  Let  Hr  =  {xx  -  x2 : 
xi,x2  S  Hr\,  r  =  1,2,  then,  dim^t  U  Hi}  =  dim{eiA  - 
eu,  e2h  -  eu  :  i  =  1,  ••-,«-  A7}  >  n  -  i7.  Because 
Fr  2  ffr,  it  follows  that  Fr  2  Hr,  r  =  1, 2.  This  means 
that  dim{.Fi  U  Fi}  >  dim{7<i  U  Hi}  >  n  —  Id. 

So  far,  the  following  statement  has  been  proven.  There 
exist  two  boundary  surfaces  Fj  and  F2  of  J  to  which  E*  is 
perpendicular  and  dim{Fi  U  F2}  >n-k'.  Next,  it  is  shown 
that  E*  belongs  to  B,  the  intersection  of  the  norm  spaces  of 
Fi  and  F2  and  dim{B}  =  k  <  k'. 

Consider  (a. 7a).  By  [18],  its  row  space  Ni  and  the  solution 
space  F\  are  orthogonal  complements  of  each  other,  i.e., 
Nx  nFi  =  0,  Hi  U  Fi  =  Rn  and  X  ■  x  =  0,  X  G  Ni  and 

x  e  Fi  (X  and  x  are  row  and  column  vectors,  respectively). 
Because  E*  is  perpendicular  to  Fi,  i.e.,  E*(xi  —  x2)  =  0. 

xi  -  x2  €  F\  (or  xj,  x2  €  Fx),  it  must  be  in  Nu  the  norm 
space  of  F\.  Vector  E*  is  perpendicular  to  both  F\  and  F2 
which  means  that  it  belongs  to  both  Nx  and  iV2.  Therefore. 
E*  must  belong  to  B  =  N\  n  N2,  i.e.,  E*  = 

where  {Bi,  ■  •  • , B*}  is  a  basis  for  B  and  c*i,  i  =  1.  •  •  • .  k. 
are  constants. 

Because  Fr  is  the  orthogonal  complement  of  Nr,  the 
orthogonal  complement  of  N\  n  N2  is  Fx  U  F2  [18]  which 
means  that 

k  =  dim{!Vi  n  iV2}  =  n  -  dim{Fx  U  F2}- 

Because  dim{Fi  U  F2}  >  n  —  kf,  k  <  n  —  (n  —  k')  =  Id. 
So  k  <  Id. 

Now,  it  has  been  proven  that  there  exists  a  fc -dimensional 
norm  space  B  such  that  (3.9)  holds  and  k  <  Id.  To  facilitate 
the  understanding  of  the  proof  of  statement  b)  of  Lemma  6,  the 
most  difficult  one  in  this  paper,  an  example  is  provided  next. 

Example  a.1:  Consider  the  algorithm  of  Example  2.1.  As 
mentioned  in  Example  3.5,  one  extreme  direction  is  identified 
as  E*  =  [0, 1]  which  satisfies  the  following  equations: 

(a)  E(5i  -  3i)  =  0  (b)  E(?3-Pj)  =  0  (a.8) 

Equation  (a.8b)  corresponds  to  (3.8b).  As  indicated  in  Fig.  1. 
p3  =s  ei  -  e*  and  Pj  =  -  ej.  The  sets  of  heads  and  tails 


iHANG  AND  FORTES:  TIME  OPTIMAL  SCHEDULES 


741 


are  {5i}  and  {54.e3}.  respectively.  In  this  case.  Hi  =  {ei} 
and  H2  is  the  line  segment  from  £3  to  e4.  Qearly,  n*  is 
perpendicular  to  both  H\  and  Hi ;  H\  is  contained  in  the 
zero-dimensional  boundary  surface  F4  =  surf {Ai,  A3}  as 
hown  in  Fig.  1  and  H?  is  contained  in  the  one-dimensional 
loundary  surface  F2  =  surffAj}  as  shown  in  Fig.  1;  11*  is 
perpendicular  to  both  F4  and  Ft;  dim{Fi  UF2}  =  1  =  n  —  k' 
n  =  2,  4/  =  1);  II*  belongs  to  the  norm  intersection  B  = 
^{Ai.As}  n  sp{i42}  =  R2  n  sj){A2}  —  sp{A2}  and 
iiim{fl}  =  4  =  k!  =  1.  End  of  example. 

r b r 

Proof  of  Statement  c):  Let  B  = 


B„ 


,H' =  [dk-it,dt- 


3tl,  -  ••  ,5*  -3V_J  and  aT  =  [<*i,- ••,<**].  If  II  in  (3.8a)  is 
substituted  by  a  TB,  then  a  can  be  found  by  solving  equation 


ciTBH'  =  0. 


(a.9) 


[I*  £  5  is  the  only  linearly  independent  solution  satisfying 
(3.8).  Equation  (3.8)  is  equivalent  to  (3.9)  and  (a.9),  i.e., 
the  vector  satisfying  (3.9)  and  (a.9)  must  satisfy  (3.8)  and 
vice  versa.  If  tank(BH’)  >  4  —  1,  then  a  =  5  which 
means  II*  =  0.  If  taak(BH')  <  4  -  1,  then  there  are  at 
least  two  linearly  independent  solutions  ai  and  02  of  (a.9). 
Then  it  can  be  shown  that  there  are  at  least  two  linearly 
independent  solutions  afB  and  a2rS  of  (3.9)  and  (a.9),  or 
equivalently  (3.8)  [30].  So  far,  it  has  been  proven  that  if 
ra nk(BH')  >4-1,  then  II*  =  0  and  if  rank(BH')  <4-1, 
then  there  are  at  least  two  linearly  independent  solutions  of 
(3.8).  In  both  cases,  the  original  assumption  is  contradicted. 
So  rank(B/T)  must  be  4  -  1. 

Proof  of  Statement  d):  Because  rank(B/f')  =  4-1, 
there  are  only  4-1  linearly  independent  column  vectors  in 
B[dt  —  3*, ,  •  •  • ,  dt  —  3tt,_J.  Without  loss  of  generality,  let  us 
assume  that  the  first  4-1  columns  are  linearly  independent. 
So,  only  4  linearly  independent  vectors  dt,  dtl ,  ■  ■  •  ,2tk_,  are 
needed  to  determine  a.  In  other  words,  (a.9)  is  equivalent 
to  (3.10).  So  II*  can  be  obtained  equivalently  by  (3.9)  and 
(3.10).  □ 

Proof  of  Corollary  5.1:  Qearly,  if  J  is  defined  by  (5.1), 
then  any  norm  vector  has  the  form  Ei  or  -Ei,  i  €  (1,  •  -  • ,  n}. 
Therefore.  [ETl,- ,Erh},  where  <=  {1, •-•,»}, 

can  be  chosen  as  the  basis  of  any  4-dimensional  norm  inter¬ 
section  of  J.  Let  II*  be  an  extreme  direction  of  for  some 
values  of  t  and  h.  According  to  Lemmas  5  and  6,  n*  can  be 
obtained  from  (3.9)  and  (3.10).  Equation  (3.10)  is  equivalent 
to 


a *j  {Bdtl -  Bfa ,  •  •  • . 4t*_, ,  dt]}  =  5.  (a.10) 


As  mentioned  before.  B  = 


where  ru---,rk  €  {1, 


,  n}.  Let  Dt  denote  D(ti  ■  ■  -tk-vt/ri  •  ■  -rk).  With  lit¬ 


tle  thought  it  can  be  seen  that  B[d\,  ■  ■  ■  ,dk~i,dt]  =  Dt- 
Next,  it  is  shown  that  V  =  01. D*  is  a  solution  of  (a.10). 
If  [at, •••.a*]  is  substituted  by  V,  the  left  side  of  (a.10) 

becomes  I 


3lD~l Bdtl  -  01.  (a.11) 


Let  detikf  denote  the  determinant  of  a  square  matrix  M. 
Notice  that  Bdt  is  the  last  column  of  Dt,  Dfl  =  D*/(detDt) 
where_  D’  is  the  adjugate  [18,  p.  170])  matrix  of  Dt  and 
D*B<Lt  —  [0, •••,0,detI?tiT.  So  (a.11)  becomes 

/3TpjT-/3T  =  0. 

Therefore,  V  is  a  solution  of  (a.10).  By  (3.9),  II*  = 
[ai,  --,Qfc]5  =  VB  =  Tl{ti---th.itlTi---rk).  So  II*  € 
C%.  By  Lemmas  1-4,  IP  is  one  of  the  extreme  directions. 
Therefore,  11°  €  C£.  O 


List  of  Symbols 

A:  index  constraint  matrix;  see  Assumption  2.1. 

A*:  row  vector  with  n  components;  the  tth  row  of  matrix  .4. 
a:  number  of  rows  of  index  constraint  matrix  A;  see 
Assumption  2.1. 

B:  vector  space;  norm  intersection;  see  Definition  2.2. 

B:  a  matrix. 

Bi:  row  vector  with  n  components;  the  tth  row  of  matrix  B. 
b:  size  vector  (column);  see  Assumption  2.1. 

C°:  candidate  set  for  (J,  D)  constructed  by  Procedure  3.1. 
C£:  candidate  set  for  {J,  D)  constructed  by  Procedure  5.1 
where  J  is  hyperparallelepiped. 

D:  dependence  matrix  with  n  rows  and  m  columns;  see 
Definition  2.1  (4). 

d,\  dependence  (column)  vector  with  n  components;  see 
Definition  2.1  (4). 

dim{n}:  the  number  of  linearly  independent  vectors  in  the 
vector  space  (or  set)  1 1. 

Eii  row  vector  with  n  components  whose  entries  are  zero 
except  that  the  tth  entry  is  one. 
e:  column  vector;  extreme  point  of  index  set  J. 

F:  set;  boundary  surface  of  J;  see  Definition  22. 
f:  objective  function  of  Problem  2.1. 

Hi :  convex  hull  of  all  heads  of  projection  vectors  in  (3.8b). 
H2:  convex  hull  of  all  tails  of  projection  vectors  in  (3.8b). 
I:  identity  matrix. 

R;  set  of  real  numbers. 

J:  index  set;  see  Definition  2.1  (1). 

J:  column  vector;  index  point;  see  Definition  2.1  (1). 
m:  number  of  dependence  vectors  in  D ;  see  Definition  2.1 
(4). 

N:  vector  space;  norm  space;  see  Definition  2.2 
N:  set  of  nonnegative  integers. 
tV+:  set  of  positive  integers. 

n:  number  of  components  of  index  points  in  J ;  see  Defi¬ 
nition  2.1  (1). 

P:  set  of  projection  vector;  see  Definition  2.6. 
p:  projection  vectors;  see  Definition  2.6. 

Q:  set  of  rational  numbers. 

q:  number  of  projection  vectors  in  P;  see  Definition  2.6. 

R:  convex  hull  of  the  index  set  J;  see  Assumption  2.1. 
rank(A):  rank  of  matrix  .4. 

5:  the  solution  space  of  Problem  2.1. 

Sth:  subset  of  5  where  dt  is  minimum  and  ph  is  maximum; 
see  (3.6). 


742 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  40.  NO.  6.  JUNE  1991 


ap{ t/i, •••  the  vector  space  spanned  by  vector 

tJi.  •  •  •  ,t Ik- 

surf{Afc. .•  •  •  boundary  surface  {x  :  A^x  =  6^, 

x  €  Rn.i  =  1,-  •  •  ,k};  see  Definition  2.2. 
x:  a  real  column  vector. 

Z:  set  of  integers. 

II:  row  vector;  linear  schedule  vector;  see  Definition  2.5. 
11°:  row  vector;  optimal  solution  of  Problem  2.1. 

IIe:  row  vector,  the  extreme  direction  satisfying  (3.8). 
lit/,:  row  vector;  optimal  solution  of  linear  program  (4.3). 
a:  a  schedule  (function);  see  Definition  2.3. 

<rn :  linear  schedule;  see  Definition  2.5. 

0:  empty  set. 

I:  a  column  or  row  vector  whose  entries  are  all  1. 

0:  a  column  or  row  vector  whose  entries  are  all  0. 

|5|:  cardinality  of  set  S. 

REFERENCES 

[1]  R.  M.  Karp.  R.E.  Miller,  and  S.  Winograd.  “The  organization  of 
computations  for  uniform  recurrence  equations,’’  J.  ACM,  vol.  14,  no.  3, 
pp.  563-590,  July  1967. 

[2]  0.1.  Moldovan  and  J.A.B.  Fortes.  “Partitioning  and  mapping  algo¬ 
rithms  into  fixed  size  systolic  arrays,’’  IEEE  Trans.  CompuL,  vol.  C-35, 
pp.  1-1Z  Jan.  1986. 

[3]  P.  R.  Cappello  and  K.  Steiglitz,  “Unifying  VLSI  array  designs  with 
geometric  transformations,’’  in  Proc.  1983  InL  Conf  Paratel  Processing, 

1983,  pp.  448-457. 

[4]  P.  Quinton.  “Automatic  synthesis  of  systolic  arrays  fro m  uniform 
recurrent  equations."  in  Proc.  11th  Anna.  Symp.  CompuL  Architecture. 

1984,  pp.  208-214. 

[5]  S.  K.  Rao.  “Regular  iterative  algorithms  and  their  implementations  on 
processor  arravs.”  Ph.D.  dissertation,  Stanford  Untv.,  Stanford.  CA,  Oct. 
1985 

[6]  M.  Chen.  “A  design  methodology  for  synthesizing  parallel  algorithms 
and  architectures."  J.  Parallel  Distributed  CompuL,  pp.  461-491,  Dec. 
1986. 

[7]  J.-M.  Delosme  and  I.C.  F.  Ipsen.  “An  illustration  of  a  methodology  for 
the  construction  of  efficient  systolic  architectures  in  VLSI."  in  Proc. 
Second  InL  Symp.  VLSI  TechnoL  SysL  Appi..  1985,  pp.  268-273. 

[8]  S.  Y.  Rung,  VLSI  Array  Processors.  Englewood  Cliffs,  NJ:  Prentice- 
Hall,  1987. 

[9]  C.  Guerra  and  R.  Melhem.  “Synthesizing  non-uniform  systolic  designs," 
in  Proc.  1986  InL  Conf.  Parallel  Processing,  1986,  pp.  765  -771. 

[10]  G.-J.  Li  and  B.  W.  Wah,  “The  design  of  optimal  systolic  arrays,”  IEEE 
Trans.  CompuL,  vol.  C-34,  pp.  66-77.  Jan.  1985. 

[11]  M.T.  O’Keefe  and  J.  A.  B.  Fortes,  “A  comparative  study  of  two  system¬ 
atic  design  methodologies  for  systolic  arrays."  in  Pro c.  1986  InL  Conf. 
Parallel  Processing,  1986,  pp.  672-675. 

[12]  J.A.B.  Fortes  and  F.  Parisi-Presioce.  “Optimal  linear  schedules  for 
the  parallel  execution  of  algorithms,”  in  Pro c  1984  InL  Conf.  Parallel 
Processing,  1984,  pp.  322-328. 

[13]  L.  Lamport.  The  parallel  execution  of  DO  loops.”  Common.  ACM, 
vol.  17.  no.  2,  pp.  83-93.  Feb  1974. 

[14]  M.S.  Bazaraa  and  CM.  Shetty.  Nonlinear  Programming:  Theory  and 
Algorithms.  New  York:  Wiley,  1979. 

[15]  W.  Shang  and  J.A.B.  Fortes,  “Independent  partitioning  of  algorithms 
with  uniform  dependencies,”  in  Proc.  1988  InL  Conf.  Parallel  Process¬ 
ing,  vol.  2.  Software,  Aug.  1988,  pp.  26-33. 

[16]  _ Time  optimal  linear  schedules  for  algorithms  with  uniform  de¬ 

pendencies,"  in  Proc.  InL  Conf.  Systolic  Arrays  May  1988,  pp.  393-402. 

[17]  R.T.  Rockefeller,  Convex  Analysts  Princeton,  NJ:  Pnncetor.  Univ. 
Press.  1970. 

[18]  G.  Strang,  Linear  Algebra  and  its  Applications.  2nd  ed.  New  York: 
Academic.  1980. 

[19]  R.  CytTon.  “Doacross:  Beyond  vecsorization  for  multiprocessors  (ex¬ 
tended  abstract),”  in  Proc.  1986  InL  Conf.  Parallel  Processing,  1986, 
pp.  836-844. 


[20]  W.  L.  Miranker  and  W.-M.  Liniger,  “Parallel  inethods  for  the  numerical 
integration  of  ordinary  differential  equation,”  Math  Comp.,  vol.  21, 
pp.  303  -320,  1967. 

[21]  R.  H.  Kuhn,  “Optimization  and  interconnection  complexity  for.  parallel 
processors,  single  stage  networks,  and  decision  trees,”  Ph.D.  dissertation, 
Dep.  CompuL  Sci.,  Rep.  80-1009,  Univ.  of  Illinois.  Urtoana-Champaign. 
IL,  1980. 

[22]  H.T.  Kung  and  C.E.  Leiserson,  “Algorithms  for  VLSI  array  proces¬ 
sors,”  in  Introduction  to  VLSI  Systems,  C.  Mead  and  L  Conway,  Eds. 
Reading,  MA:  Addison-Wesley,  1980,  section  8.3. 

[23]  N.  Weste,  D.J.  Butt,  and  B.D.  Ackland.  “Dynamic  time  warp  pat¬ 
tern  matching  using  an  integrated  multiprocessing  array,”  IEEE  Trans. 
CompuL,  vol.  C-23,  pp.  731-744,  Aug.  1983. 

[24]  D.A.  Padua,  “Multiprocessors:  Discussion  of  theoretical  and  practical 
problems,”  Ph.D.  dissertation,  Univ.  of  Illinois  at  U rbana -Champaign, 
Rep.  UIUCDCS-R-79-990,  Nov.  1979. 

[25]  J.-K.  Peir  and  R.  Cytron,  “Minimum  distance:  A  method  for  partitioning 
recurrences  for  multiprocessors,”  in  Proc.  1987  lnt.  Conf.  Parallel 
Processing,  1987,  pp.  217  -225. 

[26]  F.  Irigoin  and  R.  Triolet,  “Supemode  partitioning,”  in  Proc.  Fifteenth 
Anna.  ACM.  S1GACT -SIGPLAN  Symp.  Principles  Programming  Lan¬ 
guages  Jan.  1988,  pp.  319-329. 

[27]  B.  Lisper,  “Synthesis  of  synchronous  systems  by  static  scheduling 
in  space-time,”  Ph.D.  dissertation,  TRJTA-NA-8701,  NAD  A,  KTH, 
Stockholm,  1987. 

[28]  V.  Van  Dongen,  personal  communications. 

[29]  A.  Schrijver,  Theory  of  Linear  and  Integer  Programming.  New  York: 
Wiley.  1986. 

[30]  W.  Shang,  “Scheduling,  partitioning  and  mapping  of  uniform  depen¬ 
dence  algorithms  on  processor  arrays,”  Ph.D.  dissertation,  Purdue  Univ., 
W.  Lafayette.  IN  47907,  May  1990. 

[31]  D.G.  Luenberger,  Linear  and  Nonlinear  Programming.  2nd  Ed. 
Reading.  MA.  Addison-Wesley,  1984. 


Wejjia  Shang  (S’90)  received  the  B.S.  degree  in 
computer  engineering  and  science  from  Changsha 
Institute  of  Technology.  China,  in  1982  and  the 
M.S.  degree  in  electrical  engineering  from  Purdue 
University.  West  Lafayette.  IN.  in  1984. 

She  is  currently  a  Ph.D.  student  in  the  School 
of  Electrical  Engineering,  Purdue  University,  West 
Lafayette,  IN.  Her  research  interests  include  parallel 
processing,  computer  architecture,  scheduling  prob¬ 
lems,  and  processor  arrays. 

Ms.  Shang  Li  a  student  member  of  the  Association 


for  Computing  Machinery. 


Joae  A.  B.  Fanes  (S’80-M’83)  received  the  M.S. 
and  Ph.D.  degrees  in  electrical  engineering  from 
Colorado  State  University  and  the  University  of 
Southern  California,  respectively. 

He  is  an  Associate  Professor  at  the  School  of 
Electrical  Engineering  of  Purdue  University.  His 
technical  interests  are  in  the  areas  of  parallel  pro¬ 
cessing  and  fault-tolerant  computing  on  which  he 
has  published  more  than  50  papers.  He  has  been  on 
the  faculty  of  Purdue  University.  West  Lafayette. 
IN,  since  1984.  From  August  of  1989  until  July  of 
1990,  he  was  on  leave  at  the  National  Science  Foundation  where  he  was  a 
Program  Director  for  Microelectronic  Systems  Architecture. 

Dr.  Fortes  is  on  the  editorial  boards  of  the  Journal  of  Parallel  and 
Distributed  Computing  and  of  the  Journal  of  VLSI  Signal  Processing  and 
co-cbaired  the  program  committee  of  the  1990  International  Conference  on 
Applications  Specific  Array  Processors  (ASAP’90), 


REFERENCE  NO.  4 


Shang,  W.  and  Fortes,  J.  A.  B.,  "On  the  Independent  Partitioning  of  Algorithms  with 
Uniform  Data  Dependencies,"  IEEE  Transactions  on  Computers,  Volume  41, 
Number  2,  February  1992,  pp.  190-206. 

Note  -  This  paper  presents  a  technique  for  optimal  partition  of  algorithms  into  blocks 
of  computations  among  which  no  communication  is  required.  The  same  techniques 
can  be  used  for  the  case  when  limited  communication  is  possible.  These  techniques 
are  potentially  useful  to  reallocate  computations  in  multiprocessor  systems  where 
failures  result  in  diminished  connectivity  among  processors. 


190 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  41,  NO.  2.  FEBRUARY  1992 


Independent  Partitioning  of  Algorithms 
with  Uniform  Dependencies 

Weijia  Shang,  Member,  IEEE,  and  Jose  A  B.  Fortes,  Member,  IEEE 


Abstract —  An  algorithm  can  be  thought  o(  a  a  set  ot  in¬ 
dexed  computations  (index  set)  and  if  one  computation  uses 
data  generated  by  another  computation,  this  data  dependence 
can  be  represented  by  the  difference  of  their  indexes  (called 
dependence  vector).  Many  important  algorithms  are  character¬ 
ized  by  the  fact  that  data  dependencies  are  uniform,  i.e.,  the 
values  of  the  dependence  vectors  are  independent  of  the  indexes 
of  computations.  An  independent  partition  of  the  algorithm  is 
such  that  there  are  no  dependencies  between  computations  that 
belong  to  different  blocks  of  the  partition.  This  paper  consider* 
uniform  dependence  algorithms  with  arbitrary  index  sets  and 
proposes  two  computationally  inexpensive  methods  to  find  their 
independent  partitions.  Each  method  has  advantages  over  the 
other  one  for  certain  kinds  of  applications,  and  they  both  outper¬ 
form  previously  proposed  approaches  in  terms  of  computational^ 
complexity  and/or  optimality.  Also,  lower  and  upper  bounds  are 
given  for  the  cardinality  of  maximal  independent  partitions.  Its 
MIMD  systems,  if  different  blocks  of  an  independent  partition 
are  assigned  to  different  processors,  communications  between 
processors  will  be  minimized  to  zero.  This  is  significant  because 
the  communications  usually  dominate  the  overhead  in  MIMD 
machines. 

Index  Terms —  Data  communication,  independent  algorithm 
partition,  multiprocessor,  nested  loops,  optimizing  compiler,  syn¬ 
chronization. 


I.  Introduction 

PARALLEL  processing  holds  the  potential  for  computa¬ 
tional  speeds  that  surpass  by  far  those  achievable  by 
technological  advances  in  sequential  computers.  This  potential 
is  predicated  on  two  often  conflicting  assumptions,  namely, 
that  many  computations  can  take  place  concurrently  and  that 
the  time  spent  in  data  exchanges  between  these  computations 
is  small.  In  order  to  meet  these  assumptions,  algorithms  and/or 
programs  must  be  partitioned  into  computational  blocks  that 
can  execute  in  parallel  and  have  communication  requirements 
efficiently  supported  by  the  target  parallel  computer.  Ideally,  it 
may  be  desirable  to  identify,  if  at  all  possible,  the  independent 
computational  blocks  of  a  program,  i.e.,  those  that  require 
no  data  communication  between  them.  This  paper  describes 
two  practical  and  computationally  inexpensive  approaches  to 

Manuscript  received  March  IS.  1988:  revised  March  25.  1991.  This  work 
was  supported  in  part  by  the  National  Science  Foundation  under  Grant  DC1- 
8419745  and  in  part  by  the  Innovative  Science  and  Technology  Office  of  the 
Strategic  Defense  Initiative  Organization  and  was  administered  through  the 
Office  of  Naval  Research  under  Contracts  00014-85-lc-0588  and  00014-88-k- 
0723. 

W.  Shang  is  with  the  Center  for  Advanced  Computer  Studies.  University 
of  Southwestern  Louisiana,  Lafayette,  LA  70504. 

J.A.B.  Fortes  is  with  the  School  of  Electrical  Engineering,  Purdue  Univer¬ 
sity,  West  Lafayette.  IN  47907. 

IEEE  Log  Number  9103034. 


achieve  this  goal.  They  are  based  on  a  sound  mathematical 
framework  which  yields  optimal  results  for  a  meaningful  class 
of  algorithms  and  they  outperform  approaches  proposed  in 
extant  work. 

The  identification  of  a  possible  partition  of  an  algorithm  or 
program  can  be  done  by  the  user,  by  the  analysis  phase  of  an 
optimizing  compiler,  or  by  the  machine  at  run  time  [4].  The 
techniques  proposed  in  this  paper,  while  usable  by  a  patient 
and  dedicated  programmer,  are  best  suited  for  an  optimizing 
compiler.  They  address  the  specific  problem  of  identifying 
independent  partitions  of  an  algorithm  with  goals  that  are 
similar  to  those  of  the  earlier  works  of  Padua  [13]  and  Peir. 
Gajski  and  Cytron  [17],  [16],  [15].  The  focus  of  these  efforts 
is  on  the  optimization  of  programs  consisting  mainly  of  nested 
loops  with  regular  data  dependencies.  The  techniques  proposed 
in  those  papers  are  intended  to  complement  many  other  tools 
for  the  analysis  and  restructuring  of  sequential  programs  for 
execution  in  multiprocessing  machines  [1],  [14],  [25],  [8],  [18]. 
A  related  potential  application  of  partitioning  techniques  is 
in  the  design  of  algorithmically  specialized  concurrent  VLSI 
architectures  [10], 

In  this  paper,  nested  loop  programs  with  regular  data 
dependencies  are  modeled  as  uniform  dependence  algorithms 
which  resemble  the  uniform  recurrence  equations  considered 
in  [7]  and  the  linear  recurrences  of  [15].  Data  dependencies 
are  represented  as  dependence  vectors  (with  as  many  entries 
as  the  number  of  nested  loops)  that  describe  the  distance 
between  dependent  computations  in  terms  of  loop  indexes  (the 
vectors  are  called  dependence  distance  vectors  in  [15]  and 
are  also  considered  in  [25]  and  [2]  in  a  complemented  form). 
Dependence  vectors  are  collected  in  a  matrix,  the  dependence 
matrix,  which  is  used  in  this  paper  and  in  [13],  [17],  and  [15] 
to  identify  independent  partitions  as  briefly  described  in  the 
following  paragraphs. 

The  greatest  common  divisor  method  [13],  [15]  considers, 
for  each  row  of  the  dependence  matrix,  the  greatest  common 
divisor  of  the  entries  in  that  row.  The  resulting  greatest 
common  divisors  are  used  to  partition  the  iteration  space  of 
the  program  (also  called  the  index  set)  and  the  cardinality  of 
the  resulting  partition  is  the  product  of  the  greatest  common 
divisors,  (n  addition,  an  “alignment”  method  is  provided  in 
[13]  which  allows  in  some  cases  the  transformation  of  depen¬ 
dencies  so  that  the  values  of  the  greatest  common  divisors 
are  increased.  This  approach  is  simple  and  computationally 
inexpensive.  For  a  given  set  of  dependencies,  it  yields  a  unique 
independent  partition  which  is  not  necessarily  optimal.  In  some 
cases,  when  ail  of  the  greatest  common  divisors  equal  unity. 


0018-9340/92303.00  ffl  1992  IEEE 


SHANG  AND  FORTES:  INDEPENDENT  PARITnONINC  OF  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES 


19T 


the  number  of  the  blocks  in  the  partition  is  one,  i.e.,  the  whole 
program. 

In  the  minimum  distance  method  [17],  [IS],  the  dependence 
matrix  is  transformed  into  an  upper  triangular  matrix  which  is 
then  used  to  identify  an  independent  partition.  For  some  algo¬ 
rithms  the  cardinality  of  the  partition  is  the  product  of  the  di¬ 
agonal  elements  of  the  upper  triangular  matrix.  This  approach 
yields  partitions  that  are  better  than  those  obtained  through 
the  greatest  common  divisor  method.  However,  the  computa¬ 
tional  complexity  of  this  method  is  higher  (though  affordable 
according  to  [15])  and  the  optimality  is  nor  always  guaranteed. 

Two  approaches  are  proposed  in  this  paper.  In  the  first 
approach,  called  partitioning  vector  approach,  a  set  of  vectors 
(defined  later  in  Section  III)  is  derived  from  the  dependence 
matrix.  These  vectors  are  used  to  find  independent  partitions  of 
uniform  dependence  algorithms.  The  block  to  which  a  given 
computation  belongs  can  be  identified  by  simply  computing 
the  dot  products  of  each  of  the  vectors  by  the  index  of  the 
computation.  In  the  second  approach,  called  Smith  normal  form 
approach,  a  matrix  related  to  the  Smith  normal  form  of  the 
dependence  matrix  is  used  to  find  independent  partitions  of 
uniform  dependence  algorithms.  The  block  to  which  a  given 
computation  belongs  can  be  identified  by  the  product  of  that 
matrix  and  the  index  of  the  computation.  Both  approaches 
provide  lower  bounds  and  upper  bounds  on  the  cardinality 
of  the  resulting  partitions.  The  first  approach  gives  maximal 
partitions  for  a  meaningful  class  of  algorithms  and  the  second 
approach  yields  maximal  partitions  for  any  algorithms  with 
uniform  dependence  structure.  Comparisons  of  these  two  ap¬ 
proaches  proposed  in  this  paper  and  the  minimum  distance 
method  are  provided  in  Section  VI. 

The  organization  of  this  paper  is  as  follows.  Section  H 
presents  basic  definitions  and  notation.  Sections  in  and  IV 
present  the  partitioning  vector  approach.  In  Section  III,  parti¬ 
tioning  and  separating  vectors  are  defined  and  three  types  of 
independent  algorithm  partitions  by  these  vectors  are  derived. 
In  Section  IV,  a  procedure  to  use  these  vectors  to  find  an 
independent  algorithm  partition  is  presented  and  sufficient  con¬ 
ditions  for  the  resulting  partition  to  be  maximal  are  discussed. 
Section  V  introduces  the  notion  of  Smith  normal  form  and 
presents  a  procedure  to  find  independent  partitions  based  on 
this  notion.  Section  VI  compares  the  methods  proposed  in  this 
paper  and  the  minimum  distance  method  in  [15].  Section  VII 
summarizes  conclusions  and  points  out  some  future  work. 

II.  Basic  Definitions  and  Notation 

Throughout  this  paper,  sets,  matrices,  and  row  vectors  are 
denoted  by  capital  letters,  column  vectors  are  represented  by 
lower-case  symbols  with  an  overbar  and  scalars  correspond  to 
lower-case  letters.  The  transposes  of  a  vector  v  and  a  matrix 
M  are  denoted  vT  and  M^,  respectively.  The  symbol  E, 
denotes  the  row  vector  whose  entries  are  all  zeros  except  that 
the  ith  entry  is  equal  to  unity.  The  vector  1  (or  0)  denotes 
the  row  vector  or  column  vector  whose  entries  are  all  ones 
(or  zeroes).  The  dimensions  of  I  and  0  and  whether  they 
denote  row  or  column  vectors  are  implied  by  the  context  in 
which  they  are  used.  The  vector  space  spanned  by  a  set  of 


vectorsW  =  is  denoted- sp{t/i,V2.,--,uk}.=s 

sp{W}.  The  symbol  I  denotes  the  identity  matrix.  The  rank  oh 
a  matrix  A  is  denoted  rank(A)  and  the  determinant  of  matrix 
A  is  represented  by  det  A.  The  set  of  real  numbers  and  the 
set  of  integers  are  denoted  R  and  Z,  respectively.  The  set 
of  nonnegative  integers  and  the  set  of  positive  integers  are 
denoted  N  and  N+,  respectively.  The  empty  set  is  denoted  0 
and  the  notation  A  —  B  denotes  the  set  [z  :  x  €  A,x  &  B} 
vhere  A  and  B  are  sets.  The  notation  |P|  means  the  cardinality 
of  set  P  and  |c|  represents  the  absolute  value  of  scalar  c.  Let 
a,  b,  c.  d  €  Z  and  a  >  0,  the  notation  a  |  b  means  “a  divides 
b”,  i.e.,  b  =  ca;  and  6(mod  a)  denotes  the  modulo  operation, 
i.e.,  6(mod  a)  =  d  if  and  only  if  a  |  (6  -  d),  0  <  d  <  a.  As  a 
final  remark,  if  the  element  a  belongs  to  a  set  W.  the  notation 
a  €  W  is  used  and  this  notation  is  also  used  to  indicate  that 
a  column  vector  rhj  (or  a  row  vector  Mi)  is  a  column  (row) 
of  a  matrix  M,  i.e.,  m7  e  M(M,  €  M)  means  rhj  (Mi)  is  a 
column  (row)  vector  of  matrix  M. 

The  algorithms  of  interest  in-  this  paper  are  the  so-called 
uniform  dependence  algorithms  defined  as  follows. 

Definition  2.1  (Uniform  dependence  algorithm):  A  uniform 
dependence  algorithm  is  an  algorithm  that  can  be  described 
by  an:  equation  of  the  form 

v(~i)  =  ~  i\).v(]  -  d2),  -  ■■  ,v(j  -  in))  (2.1) 

where 

1)  j  €  J  C  Zn  is  an  index  point  (column  vector),  J  is  the 
index  set  of  the  algorithm  and  n  €  iV+  is  the  number 
of  components  of  j; 

2)  fj  is  the  computation  indexed  by  j,  i.e.,  a  single-valued 
function  computed  “at  point  j"  in  a  single  unit  of  time: 

3)  v(j)  is  the  value  computed  “at  j,”  i.e.,  the  result  of 
computing  the  right-hand  side  of  (2.1)  and 

4)  di  S  Zn,  i  =  1,  •  •  ■ ,  m,  m  €  xV  are  dependence  vectors, 
also  called  dependencies,  which  are  constant  (i.e.,  inde¬ 
pendent  of  j  6  J);  the  matrix  D  =  [di ,  •  •  • .  dm]  is  called 
the  dependence  matrix  and  rank(D)  <  min{n.m}  is 
denoted  by  m! . 

The  class  of  uniform  dependence  algorithms  is  a  simple 
extension  of  the  class  of  computations  described  by  uniform 
recurrence  equations  [7].  The  main  difference  is  that  uni¬ 
form  dependence  algorithms  allow  for  different  functions  to 
be  computed  (in  a  unit  of  time)  at  different  points  of  the 
index  set.  From  a  practical  viewpoint,  uniform  dependence 
algorithms  can  be  easily  related  to  programs  where  1)  a 
single  statement  appears  in  the  body  of  a  multiply  nested 
loop  and  2)  the  indexes  of  the  variable  in  the  left-hand  side 
of  the  statement  differ  by  a  constant  from  the  corresponding 
indexes  in  all  references  to  the  same  variable  in  the  right- 
hand  side.  Alternative  computations  can  occur  in  each  iteration 
as  a  result  of  a  single  conditional  statement  as  long  as  data 
dependencies  do  not  change.  Nested  loop:  programs  with 
multiple  statements -can  also  use  the  techniques  of  this  paper 
together  with  the  alignment  method  discussed  in  [13]  and  [15], 
For  the  purpose  of  this  paper,  only  structural  information  of  the 
algorithm.  i.e.r  the  index  set  J  and  the  dependence  matrix  D,  is 
needed.  Other  information  such  as  what  computations  occur  at 


192 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL  41,  NO.  2.  FEBRUARY  1992 


different  points  and  where  and  when  input/output  of  variables 
takes  place  can  be  ignored.  Therefore,  a  uniform  dependence 
algorithm  with  index  set  J  and  dependence  matrix  D  is  hereon 
characterized  simply  by  the  pair  (J.  D).  Also,  as  in  Definition 
2.1,  the  letters  n,  m,  and  m'  always  denote  the  dimension  of 
index  points  in  J,  the  number  of  dependence  vectors  and  the 
rank  of  the  dependence  matrix  D,  respectively. 

Definition  2.2  (Algorithm  dependence  graph  and  connec¬ 
tivity):  The  dependence  graph  of  an  algorithm  (J,  D)  is  the 
nondirected  graph  ( J.  E )  where  J  is  the  set  of  nodes  of  the 
graph  and  E  =  {( j',j )  :  j  -  j'  =  or  -  j  =  £._<£  € 

D, ]',j  6  J}  is  the  set  of  edges.  Two  index  points  j,j'  are 
connected  if  there  exist  index  points  ji,  ■  ■  ■  ,ji  €  J  such  that 
( J ,  J 1 ) ,  (ji ,  >2 )  -  ■  •  • » (Jr — l .  Jr ) » (5r  >  J' )  €  E. 

Definition  2.3  (Independent  partition,  maximal  independent 
partition  and  partitionability):  Given  an  algorithm  (J.  D) 
and  the  corresponding  dependence  graph  (J.E),  let  p  = 
{ Ji,  -  ■  ■ ,  J,},  q  £  Ar+,  be  a  partition  of  J.  If  for  any  arbitrary 
points  Ji  €  Ji  and  J2  €  Jj.i  ^  l  and  0  <  i.l  <  q,(ji,j2)  & 

E,  then  p  is  an  independent  partition  of  the  algorithm  ( J.D ). 
The  sets  J,.  i  =  1,  •  •  •  ,q,  are  called  blocks  of  partition  p.  For 
an  independent  partition  p,  if  any  two  arbitrary  points  j,j'  € 
Ji,i  —  1.  •  ■  ■ ,  q,  are  connected  in  the  dependence  graph,  then  p 
is  the  maximal  independent  partition  of  (J,  D)  and  is  denoted 
Ptb*x-  The  cardinality  of  the  maximal  independent  partition 
|Pm«xl  is  referred  to  as  the  partitionability  of  the  algorithm 
(J.D). 

Informally,  for  the  algorithm  dependence  graph,  each  index 
point  corresponds  to  a  node  and  if  a  computation  depends 
directly  on  another  one,  then  there  is  an  edge  between  their 
index  points.  For  Definition  2.3,  an  independent  partition  is 
such  that  there  are  no  dependencies  between  computations 
which  belong  to  different  blocks  of  the  partition.  In  graph 
theoretical  terms,  each  block  of  an  independent  partition  of 
(J,  D)  corresponds  to  a  component  of  its  dependence  graph 
(J.E). 

Generally  speaking,  the  shape  and  the  size  of  the  index 
set  influence  the  partitionability  of  the  algorithm  because 
of  boundary  conditions.  Consider  two  algorithms  (J.D)  and 
(J.D')  such  that  D‘  =  D.  J  =  J'UfJ}  and  j  $  J1.  i.e.,  they 
differ  only  in  the  size  of  the  index  sets.  The  corresponding 
dependence  graphs  (J.E)  and  ( J’.E ')  can  be  such  that 
Ji.h  €  J'  are  not  connected  in  (J' .  E')  but  are  connected  in 
(J.E)  because  it  is  possible  that  E  —  E' U{(Ji>.7M.7>.?2)}- 
In  other  words,  Ji  and  J2  can  belong  to  different  blocks  of  the 
maximal  independent  partition  of  ( J'.D ')  but  belong  to  the 
same  block  of  the  maximal  independent  partition  of  (J.D). 
The  following  example  illustrates  this  concept. 

Example  2.1:  Consider  algorithms  (J.D)  and  (J'.D), 
where 


D  = 

and 


3  0 
-3  2 


•  J  =  [\ji.h]T  ■  0  <  h.32  <  s.s  €  N*} 


.s€N+}. 


'-1 

1  ' 

:0‘ 

=  {J: 

0 

-1 

VI 

•**> 

0 

L-l  0  j 

s 

Fig.  1  shows  the  index  sets  J  and  J'  where  s  =  8.  These 
two  algorithms  have  the  same  dependence  matrix  but  different 
index  sets.  In  J',  point  [1,  l]T  is  not  connected  to  any  other 
points  because  [l,l]r  ±  d,,  i  —  1,2,  do  not  belong  to  J'. 
However,  in  J,  it  is  connected  to  [4,  ljT  €  J'.  End  of  example. 

The  dependence  of  the  partitionability  of  an  algorithm 
(J.D)  on  the  shape  and  size  of  its  index  set  J  is  a  complicated 
issue  and  has  practical  implications.  For  example,  in  many 
programs,  the  loop  bounds  are  not  known  at  compile  time 
and  partitions  must  be  identified  which  are  independent  of 
the  size  and  shape  of  the  index  set  and  based  solely  on  data 
dependencies.  To  concentrate  on  the  relationship  between  the 
dependence  structure  and  the  partitionability  of  the  algorithm, 
the  following  concepts  are  introduced. 

Definition  2.4  (Pseudo-connectivity):  Given  an  algorithm 
(J.D),  two  index  points  j.j'  €  J  are  pseudo-connected  if 
there  exists  a  vector  A  €  Zm  such  that  j  —  j1  +  DA. 

As  an  example  of  pseudo-connectivity,  in  algorithm  (J’.D) 
of  Example  2.1,  point  (1,  l]r  is  pseudo-connected  to  [4.0]r 
through  point  [1.3]r  €  (J  -  J'). 

Definition  2.5  (Pseudo-independent  partition,  maximal 
pseudo-independent  partition  and  pseudo-partitionability): 
Given  an  algorithm  (J.D),  let  P  =  {.A. •••,./,}  be  a 
partition  of  J.  If  any  two  arbitrary  points  ji  €  Ji  6  P 
and  ]2  e  Ji  e  P.i  £  l,  are  not  pseudo-connected,  then  P 
is  a  pseudo-independent  partition  of  the  algorithm  (J.D).  If 
P  is  a  pseudo-independent  partition  and  any  two  arbitrary 
points  j,)'  €  Ji.  i  =  1.  •••,<?,  are  pseudo-connected,  then 
P  is  the  maximal  pseudo-independent  partition  of  (J.D)  and 
is  denoted  P^.~.  The  cardinality  of  the  maximal  pseudo¬ 
independent  partition  |PU..v[  is  referred  to  as  the  pseudo- 
partitionability  of  the  the  algorithm  (J.  D). 

In  many  practical  cases,  e.g.,  when  “while”  loops  are  present 
in  a  program,  it  is  also  convenient  to  consider  algorithms 
whose  index  sets  are  arbitrarily  large  along  one  or  more 
dimensions.  The  general  case,  Le.,  when  this  applies  to  all 
dimensions,  is  captured  in  the  following  definition  and  is  also 
considered  in  this  paper. 

Definition  2.6  (Semi-infinite  index  set):  An  index  set  J  is 
semi-infinite  if  it  takes  the  following  form: 

J  =  {]  =  L?i,---,jn]T  :  o  <  ji  <  00.  Ji  €  N,  i  -  1.  ••  •  ,n} 

(2.2) 

Example  2.2:  The  algorithm  (J.D),  where  D  = 
2  —3 

[  0  ]  and  J  =  N2  is  semi-infinite,  i.e.,  J  =  {j  = 

{JU32]T  ■  o  <  31,32  <  oc.ji.j2  €  N }.  The  index  set  J 
is  partially  shown  in  Fig.  2.  The  maximal  partition  Pnux  = 
{J1.J2.J3.J4}  where  Jx  =  {[0.0lr}.J2  =  {[l.0]r}. J3  = 
{{0.1]r_.[2.0}T}  and  J4  =  {j  :  J  €  (J  -  IjLi  A)}-  Index 
points  j  1  =  [0, 0 ]T  and  J2  =  [0.  l]r  are  not  actually  connected 
in  the  dependence  graph  of  the  algorithm.  However,  they 
are  pseudo-connected  by  Definition  2.4  since  j2  =  j\  + 
DA,  A  =  [3,  2]t.  Intuitively,  j\  and  J2  are  connected  through 
points  [2,  -l]r,  [4.  — 2]T,  [6,  -3jr  and  [3,  -1]T  which  are  not 

in  J.  Partition  a _ is  not  a  pseudo-independent  partition. 

Since  det  D  =  1,  equation  DX  =  j  -  j'  always  has  an 
integer  solution  for  A.  So  any  two  arbitrary  points  in  J  are 


SHANQ  AND  FORTES:  INDEPENDENT  MXTmONINO  OP  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES 


193 


(a) 


(b) 


Fig.  1.  In  point  [1.  l|r  is  not  connected  to  any  other  points.  However,  in  /.  it  is  connected  to  many  other  points  such  as  point  [4. 0|r 

(a)  index  set  J.  (b)  Index  set  J' . 


-[h  2  ] 


J. 


Fig.  2.  The  maximal  independent  partition  of  algorithm  of  Example  2.1  is  pmU  =  { J\.  Ji,  h.  ■/«}•  However,  there  is  only  one  jlock  in  the  maximal 
pseudo-independent  partition.  Pictorially,  only  the  connectivities  of  points  near  boundaries  of  J  are  influenced. 


pseudo-connected.  This  implies  that  there  is  only  one  pseudo¬ 
independent  partition  P  =  {J}  which  is  also  the  maximal 
pseudo-independent  partition.  End  of  example. 

At  this  point,  some  comments  are  in  order.  First,  by  Def¬ 
initions  2.3  and  2.5,  a  pseudo-independent  partition  is  also 
an  independent  partition  regardless  of  the  shape  and  size 
of  the  index  set  However,  an  independent  partition  is  not 
necessarily  a  pseudo-independent  partition.  This  is  due  to  the 
fact  that  Ji ,  h  €  J  are  pseudo-connected  if  they  are  connected 
and  the  reverse  is  not  necessarily  true.  Second,  for  practical 
purposes,  it  is  sufficient  and  more  efficient  to  identify  pseudo¬ 
independent  partitions  than  independent  partitions  for  the 
reasons  explained  next  Blocks  of  independent  partitions  that 
are  not  blocks  of  a  pseudo-independent  partition  and  contain 
only  a  few  index  points  such  as  jlt  J2,  ami  in  Example  2.2 
(hereon  called  boundary  blocks)  always  occur  at  or  near  the 
boundaries  of  an  index  set  This  can  be  shown  for  the  general 
case  when  J  is  semi-infinite.  In  fact  according  to  Lemma  3 
in  (7],  there  exists  always  a  point  p  =  \pt,pt ,  •  •  •  .p»]T  €  J 
such  that  if  any  arbitrary  points  j  =  \jx, h,  ■  ■  ■ ,  jn\T  €  J 
and  ]'  =  €  J  are  beyond  p  (i.e„  ji  >  p< 

and  j'i  >  pi,  i  =  1, •••,«),  then  j  and  ]'  are  connected 
in  the  dependence  graph  if  and  only  if  they  are  pseudo- 
connected.  Boundary  blocks  are  typically  such  that  their  indi¬ 


vidual  cardinalities  are  very  small  in  relation  to  the  sizes  of  the 
algorithm  and  pseudo-independent  blocks.  As  a  consequence, 
little  additional  speedup  ran  result  from  executing  boundary 
blocks  concurrently  with  other  blocks.  Moreover,  assigning 
small  boundary  blocks  and  other  large  pseudo-independent 
blocks  to  different  processors  of  a  multiprocessor  can  cause  a 
nonbalanced  load  distribution  and  inefficient  system  operation. 
In  addition,  as  pointed  out  before,  when  index  sets  are  known 
only  at  run  time,  it  is  not  possible  to  determine  the  boundary 
blocks.  Finally,  many  algorithms  are  such  that  they  have  the 
same  partitionability  and  pseudo-partitionability.  For  ail  of  the 
above  reasons,  this  paper  considers  hereon  only  the  problem 
of  identifying  pseudo-independent  partitions  of  an  algorithm. 

EEL  Partitioning  vector  Approach— Basic  Results 

In  this  section  and  in  Section  IV  the  partitioning  vector 
approach  is  presented.  In  this  approach,  independent  algorithm 
partitions  are  determined  by  two  types  of  vectors  called 
partitioning  vectors  and  separating  vectors.  Together  with 
some  auxiliary  terminology  they  are  introduced  in  Definitions 
3.1  and  3.3.  These  definitions  are  followed  by  a  theorem  and  an 
example  which  make  clear  the  relation  between  these  vectors 
and  independent  algorithm  partitions. 


194 


tEEE  transactions  on  computers,  vol.  «i.  no.  2.  February  1992 


Defi^  ition3. 1  (Partitioning  vector,  determining  vector,  equal 
partitioning  vector  and  algorithm  coefficient >;  Given  as  algo¬ 
rithm  (J,D),Tll=  [irlt  ir2,  — ,  7T„]  €  Zixn  is  a  partitioning 
vector  of  ( J,D ),  if  and  otly  if  it  satisfies  the  following 
conditions. 

1)  gcd  (iri,Jr2,  --,irn)  =  l.2 

2)  There  exists  a  set  of  m!  =  ranJc(D)  linearly  independent 
dependence  vectors  ,  dt7 ,  •  •  ■ ,  dtn,  such  that 

II dti  =  •  ■  ■  =  ndtm,  as  ditpR  >  0.  (3.1) 

The  dependence  vectors  dtt ,  •  •  • ,  dt^,  are  called  the  de¬ 
termining  vectors  of  II  and  De  =  dfm<  ]  is  called 

determining  matrix  of  II.  The  positive  integer  diapll  is  called 
displacement  of  vector  II.  If  IIdi{mod  diapll)  =  0,  >  = 
1,  •  •  • ,  m,  then  II  is  called  an  equal  partitioning  vector  of 
( J,D ).  The  constant  a  =  gcd(Ilji,  •  •  -,11dm)  is  called  the 
algorithm  coefficient. 

For  a  given  partitioning  vector  the  set  of  determining  vectors 
is  not  necessarily  unique  and,  therefore,  dispU  also  might  not 
be  unique.  However,  given  a  partitioning  vector  and  a  set  of 
determining  vectors,  dtspll  is-  unique.  Therefore,  whenever 
dispU  is  mentioned,  it  is  associated  with  a  particular  set  of 
determining  vectors. 

By  Definition  3.1,  if  m!  —  n,  then  for  each  set  of  determin¬ 
ing  vectors  <2*,  >  •  •  • .  dtm, .  the  corresponding  partitioning  vector 
II  is  the  unique  solution  that  satisfies  conditions  1  and  2  in 
Definition  3.1  and  the  following  system  of  linear  equations: 

n(3e1-3,,)=o 

n(3e,  -5*,)  =0  (32) 

n(3tl-3t„)  =  o. 

When  m!  <  n,  the  partitioning  vector  determined  by  m! 
linearly  independent  dependence  vectors  <&, ,  •  •  • ,  dtm,  is  not 
unique  and,  of  course,  it  belongs  to  the  solution  space  of  (3.2). 
In  the  next  section,  a  closed  form  expression  is  provided  for 
a  partitioning  vector  as  a  solution  of  (3.2). 

A  partitioning  vector  II  defines  a  set  of  hyperplanes 
II;(mod  a)  =  c,  c  €  Z,  in  the  index  space.  Since  an  index 
point  lies  on  only  one  of  the  hyperplanes,  the  index  set  J  can 
be  partitioned  according  to  them,  i.e„  all  points  ;'  lying  on 
hypeTplanes  such  that  for  a  fixed  c,  II; (mod  a)  —  c,  belong 
to  the  same  block  of  the  partition.  The  following  definition 
states  this  concept  formally. 

Definition  3.2  (a-partition):  Let  II  be  a  partitioning  vector 
and  a  be  the  algorithm  coefficient  for  ( J.D ).  The  partition- 
Pa  =  {do,  •  •  • .  <4-1 }  where  Ji  =  {;  :  ;  6  J,  Ilj(moda)  - 
»},  1  =  0,  •  •  • ,  a  -  1,  is  called  the  a-partition  of  (J.  D). 

Clearly,  Pa  is  a  partition  of  J  and  it  will  be  shown  in 
Theorem  3.1  that  Pa  is  also  a  pseudo-independent  partition. 
For  the  case  where  m'  <  n,  i.e.,  rank(Z?)  <  n,  a  necessary 
condition  for  two  index  points  Ji,  S  J  to  be  pseudo- 
connected  is  that  equation  Di  —  (ji  -  ;2)  has  at  least  one 
real  solution  £  €  R"V  This  motivates  the  introduction  of  the 

1  Row  vector  11  staid  not  be  oonfn— d  with  the  prodoct  aocmoo  Q. 

2  gcd  (a,,  --.an)  a  the  greeted  co—o»  divisor  of  ai ,  •  •  •,<»„. 


following  concepts.  Let  row  vector  4^  be  such  that  'ItiD  —  9. 
Clearly,  there  are  n  —  ml  linearly  independent  such  vectors, 
denoted  V 1,  •  •  • ,  $„-•»>>  and  they  define  a  set  of  hyperplanes 

j  =  y,y  <=  Zn-m‘ ,  (3.3) 

in  the  index  space.  The  index  set  J  can  be  partitioned  such 
that  points  lying  on  the  same  hyperpiane  belong  to  the  same 
block  of  the  partition.  It  will  be  clear  later  (Lemma  8.3)  that 
if  two  index  points  ;'i,  ;2  €  J  lie  on  the  same  hyperplane 
defined  by  (3.3),  then  equation  Dx  =  (;'i  -;'2)  has  a  solution. 
These  concepts  are  formally  defined  as  follows. 

Definition  3.3  (Separating  vector  and  separating  matrix): 
Given  an  algorithm  (J,D),  'Pi  =  [^u,  •  •  •  ,^<n]  €  Zlxn  is 
a  separating  vector  of  (J.D)  if  and  only  if  it  satisfies  the 
following  conditions. 

1)  gcd(tfci*  ■  •  •  ,i>in)  =  1 

2)  ViD  =  9. 

Let  #1,  •  •  ■ ,  'Ifn-m1  be  all  the  linearly  independent  separating 
vectors;  the  matrix 

S'  =  ...  is  called  separating  matrix. 

m' m 

A  set  of  n  -  m!  linearly  independent  separating  vectors 
#„_m<  for  algorithm  (J,  D)  can  be  found  by  solving 
the  equation  in  condition  2  of  Definition  3.3.  The  following 
definition  indicates  how  to  use  these  separating  vectors  to 
construct  a  corresponding  algorithm  partition. 

Definition  3.4  (9 -partition):  Let  $  be  a  separating  matrix 
of  algorithm  ( J,  D).  The  partition  P*  =  { J9l ,  ■  ■  • ,  }  of  J 

is  called  the  9- partition  of  algorithm  ( J ,  D)  if  J9t  =  {;  :  j  € 
J.^J  =  &},  where  €  Z{-n~m')  is 

called  the  index  of  block  J9i,i  =  1,  •  •  • , rlt. 

Clearly,  P$  is  a  partition  of  J.  If  ml  =  n,  then  Py  =  {J} 
is  a  trivial  partition  since  the  only  separating  vector  is  9  in  this 
case.  As  for  Pa  P*  is  actually  pseudo-independent  as  shown 
later  in  Theorem  3.1. 

Let  Jf  €  Pn  and  consider  the  subalgorithm  ( Jg,D ). 
Clearly,  if  a  >  1,  subalgorithm  (Jg,D)  can  be  further 
partitioned  by  the  partitioning  vector  II.  In  other  words,  the 
index  set  J  can  be  partitioned  by  a  set  of  hyperplanes 

II;  (mod  a)  _  yo 

J  ~  L  y  J  ’ 

jto  S  (0, 1,  •  ■  • ,  a  —  1}  and  y  €  Zn~n*  ,  (3.4) 

and  all  points  lying  on  the  same  hyperplane  belong  to  the  same 
block  of  the  partition.  This  partition  is  formally  stated  next. 

Definition3J  (aV -partition):  Let  II  be  a  partitioning  vector 
and  ^  be  a  separating  matrix  of  algorithm  ( J,  D).  The  partition 
Pa*  -  (4, ,  •  •  ■ ,  4-}  °*  ^dex  set  J  is  called  the  a¥- 

partition  if  J9l  =  {]  :  j  €  J,  j  =  Vi),  where 

Vi  =  €  Z"-w+1  is  called  the  index 

of  block  Jg t,  i  —  .1, •••,!/. 


SHANG  AND  FORTES:  INDEPENDENT  PARTITIONINO  OP  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES 


199 


Partitioning  vectors  and  separating  vectors  play  a  very 
important  role  in  algorithm  partition.  The  next  theorem  gives 
some  of  the  motivation  for  the  introduction  of  these  concepts. 
More  specifically,  it  p.'wides  sufficient  conditions  for  two 
computations  to  belong  to  different  blocks  of  an  independent 
partition,  in  terms  of  those  vectors  and  the  index  points 
associated  with  the  computations.  Moreover,  it  shows  that 
a-partitions,  '{'-partitions,  and  a  {'-partitions  are  all  pseudo¬ 
independent. 

Theorem  3.1:  Let  II  be  a  partitioning  vector,  a  be  the  algo¬ 
rithm  coefficient,  and  $  be  a  separating  matrix  of  algorithm 
( J,D ),  respectively.  The  following  statements  are  true: 

1)  For  any  two  arbitrary  points  ji,  72  £  J,  if  Fiji  (mod  a)  # 
11/2  (mod  a)  then  they  are  not  pseudo-connected.  There¬ 
fore,  Pa  is  a  pseudo-independent  partition  of  ( J,D ). 

2)  For  any  two  arbitrary  points  71,72  £  if  Wji  # 
then  they  are  not  pseudo-connected.  Therefore,  P#  is  a 
pseudo-independent  partition  of  ( J,D ). 

3)  Pa*  is  a  pseudo- independent  partition. 

Proof:  Provided  in  Appendix. 

Corollary  3.1  rf  algorithm  (J.  D)  has  an  equal  partition¬ 
ing  vector  II,  iht :  Ji ,  72  €  J  are  not  pseudo-connected  if 
Il7i(mod  dtspll)  Il^mod  dispU.)  or  {'71  {'72- 

As  a  particular  case  of  Theorem  3.1,  Corollary  3.1  is- 
obviously  true.  If  algorithm  ( J,D )  has  an  equal  partitioning 
vector  II,  then  the  algorithm  coefficient  a  =  disp II.  By 
Theorem  3.1,  Corollary  3.1  holds. 

Example  3.1:  Consider  algorithm  (J.  D)  where  J  = 
{l?i.J2]T  :  0  <  31J2  <  3.9  £  :V+}  and  D  =  [d]  where 
d  =  [2.2]t.  Fig.  3  shows  the  index  set  J  for  s  =  4. 
There  is  only  one  possible  set  of  determining  vectors  {J}. 
One  of  the  partitioning  vectors  determined  by  d  is  II  = 
[-1.2].  It  follows  that  dispH  =  Ud  =  2  and  the  algorithm 
coefficient  a  =  2.  Consider  index  points  71  =  (0,0]T  and 
72  =  [1.0jT;  since  Il7i(mod  a)  =  0  and  Il^mod  a)  =  1, 
by  Theorem  3.1,  they  are  not  pseudo-connected.  There  is  only 
one  linearly  independent  separating  vector  {'j  =  [1,-1]  and 
a  separating  matrix  is  =  [1.  — 1].  Again,  consider  index 
points  71,73  =  [0. 1]T  for  which  {'71  =  0  and  {'73  =  -1. 
By  Theorem  3.1,  Ji  and  73  are  not  pseudo-connected.  In 
Fig.  3(a)  and  (b),  hyperplanes  Il7(mod  a)  =  c i  and  'T/j  = 
C2>(ci,C2  €  Z ),  are  drawn,  respectively.  All  the  points  lying 
on  the  same  hyperplane  Il7(mod  a)  =  ci  belong  to  the 
same  block  of  the  a-partition  and  all  the  points  lying  on 
the  same  hyperplane  'll]  =  cj  belong  to  the  same  block  of 
the  {'-partition.  Fig.  3  shows  the  a-partition,  {'-partition,  and 
a  {'-partition  pictorially.  Let  3=3,  then  Pa  —  (Jo,  «A }  where 
Jo  =  {[0.0]T,  [0.1jT.  [0, 2]r.  [0. 3]r.  [2, 0]T.  [2, 1]T.  [2,2}T, 

[2.3] T}  andJx  =  {[1.0]r.  [1, 1]T.  [1.2|r,  [1.3]r,  [3,0)T, 
[3,  ljT.  [3.2]r.  [3.3]t}.  Also  P*  =  {J[-o\,---1J\m,---,J{3\} 
where  Jr3|  =  {[3.0jT},  Jp)  =  {[2.0]r.  [3.  lj5},  J\  1]  = 

{ [1- 0]T.  [2.  1]t.[3, 2]T},  7r0)  =  UMr-  [UlT.  [2’2!T- 
[3» 3]r},  =  {[O.ir,  [i,2]r,  [2, 3]T},  J{. 2]  =  {[0, 2]T, 

[1.3] r}  and  ./r_3|  =  {[0,3]T}.  The  partition  Pa*  can  be 
obtained  by  intersecting  j  f|  Jtj\,  i  =0, 1  and  7  =  -3,  •  •  • ,  3. 
Table  I  lists  ail  blocfs  of  a  {'-partition  and  their  index  points. 
|Pa*|  =  12  <  a|P*|  =  14.  Clearly,  P Pa,  and- 


are  pseudo-independent  partitions.  In  Section  IV,  it  is  shown 
that  the  a 'I' -partition  is  also  the  maximal  pseudo- independent 
partition.  End  of  example. 

By  Theorem  3.1,  if  there  is  at  least  one  point  j  €.  J 
such  that  Il7(mod  a)  =  i,  then  Ji  €  Pa  is  not  empty,  i.e., 
Ji  £  0,  i  =  0,  •  •  • ,  a  —  1.  Therefore,  |Pmax|  >  a.  Intuitively, 
if  J  is  large  enough  and  dense  (informally,  an  index  set  J 
is  dense  if  any  arbitrary  point  7  €  Zn  that  is  inside  the 
boundaries  of  J  belongs  to  J),  then  for  any  arbitrary  value 
of  i,  0  <  t  <  a  and  i  €  Z,  there  usually  exists  at  least 
one  index  point  7  such  that  Il7(moda)  =  i.  Therefore,  it  is 
reasonable  to  make  the  following  assumption: 

Assumption  3.1  (Index  set):  For  an  algorithm  ( J,D )  under 
consideration  in  this  paper,  let  II  be  a  partitioning  vector  and  a 
be  the  algorithm  coefficient.  It  is  assumed  that  for  any  arbitrary 
value  of  i  £  Z,  0  <  i  <  a,  there  is  at  least  one  point  j  €  J 
such  that  Il7(mod  a)  =  i. 

Corollary  3.2:  Let  a  and  Pa  be  the  algorithm  coefficient  and 
the  a-partition,  respectively.  Then  |P„|  =  a  under  Assumption 
3.1. 

The  next  theorem  shows  that  this  is  true  if  the  index  set  J 
is  defined  by  (2.2),  i.e.,  J  =  Nn.  Therefore,  [Pm.„|  >  a  if 
J  is  semi- infinite. 

Theorem  3.2:  Let  II  be  a  partitioning  vector  of  ( J,  D)  where 
J  is  defined  by  (2.2)  and  a  be  the  algorithm  coefficient.  Then 
for  any  arbitrary  value  of  i  e  Z,  0  <  i  <  a,  there  exists  at 
least  one  index  point  7  6  J  such  that  Il7(moda)  =  i  and  the 
pseudo-partitionability  of  ( J.  D)  is  greater  than  or  equal  to  a, 
i.e.,  |P maxi  —  a- 

Proof:  Provided  in  Appendix. 


rv.  Partitioning  Vector  Approach — Procedure 

In  this  section.  Theorem  3.1  and  other  results  and  concepts 
introduced  in  Section  III  are  used  to  prescribe  a  partitioning 
procedure.  Afterwards,  Section  IV-A  discusses  how  tc  find  the 
partitioning  vectors  required  by  the  procedure.  Then  Section 
IV-B  characterizes  algorithms  for  which  the  method  yields  the 
optimal  partition  and  derives  lower  and  upper  bounds  on  the 
pseudo-partitionability  of  arbitrary  uniform  dependence  algo¬ 
rithms.  The  independent  partitioning  procedure  is  as  follows: 

Procedure  4.1  (Finding  a’9 -partition  for  algorithm  (J,D) 
by  partitioning  vector  approach): 

Input:  Algorithm  ( J,D ). 

Output:  a# -partition  Pa<v  for  algorithm  ( J.D ). 

Step  1:  Select  ml  linearly  independent  dependence  vectors 
dtl,  ■  •  • ,  dt„, .  set  Dc  =  [dti,  ■  •  •  •  dtm>],  find  T  e 
Zm  *«  such  that  rank(TDc)  =  m!  and  compute  the 
corresponding  partitioning  vector  II  according  to 
Theorem  4.2  provided  in  Section  IV-A.  If  dieptl  ^ 

|  det(TDc)|,  then  select  another  set  of  mf  linearly 
independent  dependence  vectors  and  compute  the 
corresponding  partitioning  vector  until  all  distinct 
sets  of  m1  linearly  independent  dependence  vectors 
are  considered.  If  a  partitioning  vector  II  such  that 
dispU.  =  |det(TZ?c)|  is  not  found,  then  select 
the  partitioning  vector  II  such  that  ^  IS 


196- 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOfc.  4t*  NOl  2,  FEBRUARY  199: 


(C) 


Fig.  3.  Parti  tiona  of  algorithm  of  Example  3.1  where  D  —  (2.2|r.  II  =  [—1.2],  and  '('  =  [1.— 1].  (a)  a-partition:  the  hyperplanes  are  described  by 
IT;  (rood  2)  =  ci.  Points  lying  on  dotted  lines  belong  to  block  Jrj  €  Pa  and  points  lying  on  dashed  lines  belong  to  block  J\  €  Pa-  (b)  '('•partition: 
the  hyperplanes  are  described  by  '('J  =  cj.  Points  lying  on  hyperplane  <1 >J  =  cj  belong  to  block  6  P*.  (c)  a 'I' -partitions  dotted-  and  dashed 
lines  specify  the  ct-parmion  and  solid  lines  specify  the  '('-partition. 


TABLE  I 

List  os  Blocks  and  their  Corresponding  Index  Points  op  the  aw-PAitrrnoN  op  algorithm  op  Example  3.1 


minimum.  Then  compute  the  algorithm  coefficient 
a  =  gcd(nJti'**n<tn). 

Step  2:  Obtain  n-m!  linearly  independent  separating  vec¬ 
tors  Vi,  •  •  - ,  ^ n-m1  by  solving  equation  'ifiD  =  0. 

f 

Set  = 


yu 

S In—m1  i  _ 

block  indexed  by  y,-,  Le^.  j 

Step  4-:  =  { J7l ,  •  •  • ,  JyJ;  Stopi  □ 

A.  Finding  a  Partitioning  Vector 

This  subsection  provides  in  Theorem  4.2  a  closed  form 


,  then-  assign  j  to  Jgt,  the 


SHANG  AND  FORTES:  INDEPENDENT  PARTITIONING  OF  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES 


197 


expression  for  the  computation  of  partitioning  vector  II,  as 
required  in  Step  1  of  the  partitioning  Procedure  4.1.  In 
addition,  an  equal  partitioning  vector  is  preferred  whenever  it 
exists.  This  is  because  of  the  simple  and  regular  mappings  that 
result  from  equal  partitioning  vectors  when  time  scheduling  of 
computations  is  considered  [21].  So,  necessary  and  sufficient 
conditions  are  provided  in  Theorem  4.1  and  Corollary  4.1  for 
the  existence  of  this  type  of  vectors  for  a  given  algorithm. 

Theorem  4.1:  An  algorithm  (./.  D)  has  an  equal  partition¬ 
ing  vector  if  and  only  if  there  exists  a  set  of  m'  linearly 


independent  vectors  dti,dt 2, 

•  •  •  dtm 

such  that 

- 

^11 

<112 

D  —  ^t2»  ’  *  *  •  ^tm'  1 

^21 

<122 

p 

1 

Um'2  •  •  • 

Qm'm  - 

a ij  €  R  (4.1) 

where  a<j  is  an  integer,  j  =  1,  •  •  • ,  m. 

Proof:  Provided  in  Appendix. 

It  is  not  easy  to  test  whether  a  given  algorithm  has  an  equal 
partitioning  vector  using  the  condition  in  Theorem  4.1.  The 
following  corollary  provides  sufficient  conditions  which  are 
easier  to  test. 

Corollary  4.1:  An  algorithm  ( J,  D)  has  an  equal  partitioning 
vector  if  it  satisfies  one  of  the  following  conditions: 

1)  rank([di  -  -  d3,  •  •  •  .di  -  i,,])  <  rank(£>). 

2)  There  exists  a  set  of  m!  linearly  independent  dependence 
vectors  dt, ,  •  •  • .  such  that  all  dependence  vectors 
can  be  expressed  as  an  integer  linear  combination  of 

dt\  >  ■  ■ '  >  *  i-c.,  dj  —  S»s=i  i  j  =  1»  ■  •  ‘ 

where  <Hj,  i  =  1,  •  •  • ,  ml  and  j  =  1,  •  •  • .  m,  are  integer 
constants. 

Proof:  Provided  in  Appendix. 

If  algorithm  (J.D)  satisfies  condition  1  in  Corollary  4.1, 
then  it  has  an  equal  partitioning  vector  II  such  that  Ildi  = 

•  •  •  =  lid*,  =  displl.  To  see  if  a  given  algorithm  satisfies 
condition  2  in  Corollary  4.1,  one  has  to  see  if  (4.1)  has 
an  integer  solution.  This  can  be  achieved  by  applying  the 
necessary  and  sufficient  conditions  for  a  linear  system  of 
equations  to  have  an  integer  solution  provided  in  [19],  [21]. 

Given  m'  linearly  independent  vectors  dn ,  •  -  • .  dtm< ,  the 
corresponding  partition  vector  II  belongs  to  the  solution  space 
of  (3.2).  In  [3],  a  closed  form  expression  for  a  partitioning 
vector  which  is  determined  by  dt\ ,  •  •  • ,  dtm>  is  given.  This 
result  is  restated  as  Theorem  4.2  as  follows. 

Theorem  4.2  [3]:  Let  <&,,••• ,  be  linearly  independent, 
consider  matrix  Dc  =  [d*, ,  •  •  • .  d,<J  and  let  T  €  Zm  xn  be 
such  that  rank (TDa)  -  ml.  Then  II  =  pi (TDe)~lT  is  a 
partitioning  vector  determined  by  dt\ ,  •  •  • .  dtm1  and  dispEI  = 
P,  where  P  €  N+  is  such  that  II  €  Zlxn  and  the  greatest 
common  divisor  of  the  n  components  of  II  is  equal  to  one. 

Notice  that  matrix  T  €  Zm  xn  such  that  rank(TZ?e)  =  m! 
always  exists.  Because  rank(Z?c)  =  m',  there  are  m1  linearly 
independent  rows  i npc.  Suppose  rows  n ,  •  • ,  rm<  are  linearly 

independent.  If  T  =  [  •  •  ]  where  Er  v,  •  •  • ,  ETm‘  are  as 


defined  in  the  beginning  of  Section  II,  then  nnk(TDc )  =  m! . 
In  other  words,  the  result  of  multiplying  Dc  by  T  is  a  square 
submatrix  of  D  that  contains  exactly  ml  linearly  independent 
rows  of  the  ml  linearly  independent  columns  of  D.  If  m’  =  n. 
the.  T  =  I,  the  identity  matrix,  and  II  =  jfllD"1.  The  essence 
of  the  proof  is  as  follows  [3].  Because  Pl(TDc)~lTDc  =  pi, 
vector  Pl(TDe)~1T  satisfies  (3.1)  and  meets  conditions  1 
and  2  it:  Definition  3.1  by  the  meaning  of  the  constant  p; 
so  II  =  Pi(TDc)~lT  is  a  partitioning  vector  determined  by 
dt, ,  •  •  • ,  df^,  and  dispU.  =  p  >  0. 

B.  Sufficient  Conditions  for  Optimality 

Theorem  3.1  provides  a  necessary  condition  for  two  index 
points  in  J  to  be  pseudo-connected.  Next  it  is  shown  in 
Theorem  4.3  and  4.3a  that  when  the  dependence  matrix  D 
satisfies  certain  constraints,  this  condition  becomes  sufficient. 
The  implication  of  this  result  is  that  the  partition  obtained 
by  Procedure  4.1  is  maximal.  In  order  to  motivate  and  facilitate 
the  understanding  of  the  main  results  of  this  section,  first, 
a  special  case  is  discussed  in  Theorem  4.3  where  ml  —  n, 
i.e,  rank(D)  =  n.  In  this  case,  the  ^-partition  is  trivial, 
i.e.,  P#  =  {J}.  Therefore,  by  Theorem  3.1,  the  necessary 
condition  for  two  index  points  j\,j2  €  J  to  be  pseudo- 
connected  is  nji(moda)  =  II72  (moda). 

Theorem  4.3:  Let  ml  =  n.  II  be  a  partitioning  vector  of  algo¬ 
rithm  (J,  D)  determined  by  dt, ,  ■  ■  ■ .  d«, ,  De  =  [dtl ,  •  •  • .  dfj 
and  a  be  the  algorithm  coefficient.  If  |det£>c|  =  dtspU.  then 

1)  two  index  points  Ji,  J2  €  J  are  pseudo-connected  if  and 
only  if  IIji(mod  a)  =  nj2(mod  a); 

2)  the  a-partition  is  the  maximal  pseudo-independent  par¬ 
tition  of  (J.D),  i.e.,  Pmax  =  Pa,  and  IP^I  =  a. 

Proof:  Provided  in  Appendix. 

In  this  case.  Procedure  4.1  becomes  very  simple.  Since 
rank(D)  =  n.  there  is  only  one  trivial  separating  vector  0 
and  therefore.  P*  =  {J}.  So  Step  3  in  Procedure  4.1  can 
be  skipped.  When  II  is  an  equal  partitioning  vector,  then 
ni(mod  dispU.)  =  0,  i  =  1,  •  •  • ,  m.  So  a  =  dtspll  = 

|  det  Dc\-  This  fact  is  summarized  as  Corollary  4.2  as  follows. 

Corollary  4.2:  Let  ml  =  n,  II  be  an  equal  partitioning  vector 
of  algorithm  (J.D)  determined  by  dt,,---,dtn  and  Dc  = 
[dt,,- •  •  ,dtn].  If  | det Dcj  =  dispU.  then  the  pseudo- 
partitionability  of  (J.D)  is  equal  to  the  absolute  value  of  the 
determinant  of  matrix  Dc,  i.e..  |.Pma*|  =  |  det  X>c|- 

The  meaning  of  Corollary  4.2  is  as  follows.  For  a  class 
of  algorithms,  the  number  of  blocks  in  the  maximal  pseudo¬ 
independent  partition  is  equal  to  { det  Dc\,  the  absolute  value 
of  the  determinant  of  a  submatrix  of  the  dependence  matrix 
D.  If  the  algorithm  is  to  be  executed  by  clusters  of  processors 
with  limited  intercluster  communication  capabilities  then  the 
number  of  clusters  to  be  used  should  be  directly  related  and 
perhaps  equal  to  the  cardinality  of  the  pseudo-independent  par¬ 
tition.  In  such  MIMD  systems.  |  det  Dc\  is  a  direct  indication- 
of  how  many  clusters  can  be  used  to  execute  the  algorithm. 

To  find  the  necessary  and  sufficient  conditions  for  two 
points  ji  J2  6  J  to  be  pseudo-connected  in-  general  case, 
the  approach  used  here  is  as  follows.  First,  a  subalgorithm 
( .Jr,.  D)  where  €  Pv  is  considered  and  the  necessarv 


190 


(EEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41,  NO:  2.  FEBRUARY  1992 


and  sufficient  conditions  for  two  points  juj2  €  J?  to  be 
pseudo-connected  are  derived.  To  achieve  this,  the  algorithm 
(J$rD)  is  transformed,  by  a  linear  mapping  T,  into  another 
algorithm  ( T(Jg),T(D ))  where  the  dimension  of  the  index 
points  is  ml  and  there  are  ml  linearly  independent  dependence 
vectors.  Then  Theorem  4.3  is  applied  to  find  these  necessary 
and  sufficient  conditions  for  algorithm  (T(Jg),T(D)).  Then 
it  is  shown  that  the  mapping  T  is  bijective  and  algorithms 
( Jg,D )  and  {T(Jg),T{D))  are  equivalent  in  the  sense  that 
jx,  j2  €  Jg  are  pseudo-connected  in  algorithm  ( Jg,D )  if 
and  only  if  T(ji),T(j2 )  are  pseudo-connected  in  algorithm 
{T(J,]),T(D)).  So  these  necessary  and  sufficient  conditions 
for  algorithm  ( T(Jg),T(D ))  are  actually  valid  for  algorithm 
(Jg,D). 

Theorem  4.3a:  Consider  algorithm  ( J,  D ),  let  d*, ,  •  •  • ,  dtm, 
be  linearly  independent,  Dc  =  [<£,,•■•  ,cftm,],T  6  Zm  %n 
be  such  that  rank(TDc)  =  m',11  =  dispn.i(TDc)~lT 
be  the  partitioning  vector  determined  by  d*, ,  •  •  • ,  dtm, ,  a  be 
the  algorithm  coefficient  and  $  be  a  separating  matrix.  If 
|det(TZV'  =  dispU,  then 

1)  two  points  ]i ,  j%  6  d  are  pseudo-connected  if  and  only 
if  II(Ji  -  j2)(moda)  =  0  and  =  $72; 

2)  the  a '{'-partition  is  the  maximal  pseudo-independent 

partition  of  (J,D),  i.e.,  =  Pa*. 

3)  [P-t-J  <  nr-T  (7.  +  1).  where  7i  =  max{'{'i(j1  -j2)  : 
Ji,J2  €  J},i  =  l,---,n  -  m',  and  a  <  [P^]  < 
a|>*|. 

Proof:  Provided  in  Appendix. 

If  the  cardinalities  of  the  a-partitions  of  algorithms  ( Jg,D ), 
where  Jgi  €  P*,  i  =  1,  •  •  • , V,  are  all  equal  to  a,  then 
|Pm*x|  =  a\P*\-  However,  for  some  block  Jg  €  Pj,,  the 
cardinality  of  its  a-partition  might  be  less  than  a  because  for 
some  value  of  i  €  Z,  0  <  i  <  a.  there  might  not  exist 
any  index  point  j  €  Jg  such  that  IIj(mod  a)  =  t.  This 
phenomenon  is  illustrated  in  the  following  example. 

Example  4.1:  Consider  the  algorithm  of  Example  3.1  with 
3  —  3.  There  is  only  one  set  of  determining  vectors  {d}  and 
Dc  =  D.  If  T  =  [-1,2],  then  TDC  =  [2].  According  to 
Theorem  4.2,  II  =  21[2]—1  [ — 1.2]  =  [—1,2]  and  dispU.  = 

2  =  det(7\Dc).  As  in  Example  3.1,  the  separating  matrix 
'P  =  [1,  -1].  To  illustrate  Theorem  4.3a  1),  consider  points 
ji  =  [0.0|T  and  j2  =  [2,2]T.  Because  riji(mod  a)  = 
IIj2(mod  a)  and  '{’Jt  =  by  Theorem  4.3a,  they  are 
pseudo-connected.  Due  to  the  fact  that  dispU  =  det (TDe), 
by  Theorem  4.3a,  PQ+  is  the  maximal  pseudo-independent 
partition.  Consider  Jr3|  6  P«,  i.e.,  the  block  whose  points  j 
are  such  that  1/j  =  3.  d[3|  =  {[3,0]7}  as  found  in  Example 
3.1.  There  does  not  exist  any  index  point  j  €  J\2\  such  that 
Hj(mod  a)  —  0.  This  illustrates  the  explanation  before  this 
example.  By  Theorem  4.3a  3),  (P*|  <7  +  1  =  7,  where 
7  =  3-  (-3)  =  6  and  [P^l  =  12  <  a|P*j  <  14.  End 
of  example. 

The  complexity  of  Procedure  4.1  is  a  linear  function  of  the 
cardinality  of  the  index  set.  Step  1  computes  the  partitioning 
vector.  The  complexity  of  the  product  of  the  number  of 
memory  locations  (space)  and  the  number  of  the  operations 
(time)  is  bounded  by  0((™,  )n3).  For  step  2  the  spacexdme 


complexity  is  bounded  by  0((n  -  m')n5).  Step  3  needs  at 
most  0(min{rr,n  —  m!  +  l}n[J|)  operations.  The  total  com¬ 
plexity  is  bounded  above  by  0((  )n?)  -t-0((n  -  m')ns)  + 

0(min{n,n  -  ml  4*  l}n|J]).  When  m  =  ml  =  n,  the 
complexity  is  bounded  above  by  0(n5)  +  0(n|  J|). 


v.  Smith  Normal  Form  Approach 


The  partitioning  vector  approach  is  simple  and  the  parti¬ 
tioning  vector  and  separating  vectors  are  easy  to  compute. 
However,  when  the  algorithm  does  not  satisfy  the  condition 
in  Theorem  4.3a,  the  optimality  is  not  always  guaranteed. 
This  section  discusses  the  Smith  normal  form  approach  which 
yields  maximal  pseudo-independent  partitions  for  any  arbitrary 
uniform  dependence  algorithms.  This  approach  uses  the  Smith 
normal  form  of  the  dependence  matrix  D  which  is  introduced 
in.  Theorem  3.1.  This  theorem  is  followed  by  the  definitions 
of  the  partitioning  matrix  and  the  displacement  vector.  These 
concepts  are  then  used  to  define  a  partition,  of  index  set.  J 
which  is  also  the  maximal  pseudo-independent  partition  of  the 
algorithm.  Then  a  procedure  is  presented  which  constructs  the 
maximal  pseudo-independent  partition  of  a  given  algorithm. 
Complexity  of  the  procedure-  is  also  discussed. 

Theorem  5.1  (Smith  normal  form)  [19,  p.  50]:  Given  a  matrix 
D  6  Znxm,  there  exist  two  unimodular3  matrices  U  €  Znxn 
and  V  €  Zmxm  such  that 


r  31 

0 

...  0 

0 

...  0  -] 

0 

32 

...  o 

0 

...  o 

UDV  =  S  = 

0 

0 

•  •  •  3m’ 

0 

...  o 

0 

0 

...  o 

0 

...  o 

.  o 

0 

...  o 

0 

...  0  . 

Matrix  5  is  called  the  Smith  normal  form  (abbreviated  SNF) 
of  matrix  D,  it  is  unique,  si,  ■  •  • ,  sm>  are  positive  integers, 

*i,  M  •  •  •  |sm' ,  Till  *<•  k  =  1,  •  •  • .  ml,  is  the  greatest  com¬ 

mon  divisor  of  subdetenninants  of  order  k  of  the  dependence 
matrix  D  and  ml  is  the  rank  of  the  dependence  matrix  D.  □ 
More  details  and  explanations  about  SNF  can  be  found  in 
[19,  p.  50]  and  [24].  The  following  example  illustrates  these 
concepts  just  introduced. 

Example  5.1:  Consider  the  algorithm  ( J,D )  studied  in  [15] 
"  1  2  0" 

-2  4  4 

4-12 

1  <  ji  <  16,  ji  6  iV+,  i  =  1,2,3} 

V  are  as.  follows. 


for  which  D  = 


and  J  —  {[71,  ja,j3lT  : 
The  matrices  U,  5,  and 


5  = 


u  = 

1 

2 

0 

-1 

o  ' 

-l 

1  0 

L  - 

0  ' 

14 

9  8 

r  i  -2 

-12" 

0  1 

0 

o 

i 

6 

0  0 

52 

L  o 

o- 

1 

It  is  easy  to  verify  that  UDV  =  S  and  U  and  V  are 
unimodular.  End  of  example. 

3  A  nonsiaguiar  matrix  is  unimodular  if  its  elements  are  integral  and  its 
determinant  is  el. 


5 HANG  AND  FORTES.  INDEPENDENT  PARTITIONING  OF  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES 


Definition  5.1  (Partitioning  matrix  and  displacement  vec¬ 
tor):  Given-  an  algoritiun  ( 7 ,  D ),  the  matrix  such-  that 
UDV  =  5  is  the  SNF  of  D  is  called  partitioning  matrix 
of  ( J.D ).  Given  that  si,  •  •  • , 3m<  are  the  nonzero  diagonal 
elements  of  S,  the  vector  s  =  [si,  •  •  - ,  sm< ,  oo,  •  •  • ,  oo]T  e  Nn 
is  called  displacement  vector  of  (J.  D). 

Definition  5.2  (U • partition ):  Let  U  be  a  partitioning  matrix 
of  algorithm  ( J.D );  the  partition  Pu  =  {7$,  ■  ■  • ,  •/$„}  of 
index  set  J  is  called  the  U -partition  of  algorithm  (J.D)  if 
1qx  =  {]  :  t/j(mod  s)  =  yi,j  €  7}, 4  where  y,  6  Zn  is  called 
the  index  of  block  J 5j,  i  =  1.  •  •  • ,  p. 

Example  5.2:  Consider  the  algorithm  of  Example  5.1.  U  is  a 
partitioning  matrix;  1  =  [1.  l,52jT  is  the  displacement  vector; 
Pu  —  { 7(o,o, o|T- ' "  •  70.o,siF  }  is  the  ^-partition  where 
J’fo.o.iJ7'  =  {J  :  Uj(mod3)  =  [0,0.ijT,J  €  7},i  =  0,  •  ,51. 
End  of  Example. 

It  is  clear  that  Pu  is  a  partition  of  the  index  set  7  because 
for  each  j  €  7,  £/j(raods)  is  unique.  Actually,  Pu  is  a  pseudo¬ 
independent  partition.  To  show  the  pseudo-independence  of  U- 
partition,  the  following  lemma  is  introduced  first  and  followed 
by  a  theorem. 

LemmaS.l:  Given  algorithm  (J.D),  let  Ji,j2  6  7  and!  be 
the  displacement  vector,  then  j\  and  J2  are  pseudo-connected 
if  and  only  if  Uji( mod  3)  =  {/j2(mod  I). 

Proof:  Provided  in  Appendix. 

Theorem  5.2:  Given  an  algorithm  (J.D),  let  U.  be  the 
ith  row  of  the  partitioning  matrix  U,  i  =  1.  •  •  • ,  n,  = 
ma x{Ui]  :  j  €  7},5,i  =  min {Uij  :  j  €  7}  and  7.  = 
<5<u  -  hi  ■+•  1.  i  =  rn'  -h  1,  n.  The  following  statements 
are  true: 

1)  The  U -partition  is  the  maximal  pseudo-independent  par¬ 
tition,  i.e.,  Pu  =  PtOAX- 

2)  The  pseudo-partitionability  is  bounded  above  by 

n*=rl  rii*m'  +  l  Tit  i-6-»  l-Pm**l  =  \Pu\  S  Ilkast 

n«=m'  +  t  T»- 

Proof:  Provided  in  Appendix. 

For  every  vector  y  =  [yi,  •  •  • ,  iy„]T,0  <  Vi  <  Si.i  = 
1.  • ■  • ,  m',  6ii  <  j ik  <  6i u,  k  =:  m!  4- 1,  •  •  • ,  n,  if  there  exists 
at  least  one  index  point  j  €■  7  such  that  j  €  7$,  then  7S  ^  0 

and  \Pu\  =  UZi  n?=.m'+i  T».  Because  det  U  =  ±1,  for 
each  such  yector  y,  there  always  exists  an  integer  vector 
j  such  that  Uj (mods)  =  y.  Therefore,  it  is  reasonable  to 
assume  that  \Pu\  =  FITam'+i  T»-  1°  particular,  for 

algorithms  whose  index  set  7  =  Zn.  this  assumption  in  true. 
For  the  special  case  where  m  =  m'  —  n.  by  Theorem  5.1, 
nr-t  3'  =  I det  D\-  ^  for  each  v«ctor  y  =  [yi,  •  -  • ,  yn\T .  0  < 
y,-  <  Si ,  i  =*  1.  •  •  • .  n,  there  is  at  least  one  index  point  ;  €  7 
such  that  j  €  7$,  then  the  following  corollary  is  true. 

Corollary  5.1:  Given  algorithm  (7,13),  let  m  =  ml  =  n 
and  3  =  [si,  •  •  •  ,  s„]T  be  the  displacement  veaor.  Then 
| Pm..  |  =  n"_,  Sj  =  ( det  D\,  i.e.,  the  pseudo-partitionability 
of  algorithm  (7,13)  is  equal  to  the  absolute  value  of  the 
determinant  of  dependence  matrix  D. 

‘Let  A  ss  [Ai,-  •  • .  A„]^*  €  Zn  and  i  =  [ei.---.jm/. 30 . »|r 

with  Si  >  0,  i  =  1.  -  • ,  m! .  The  notation  A(mod  j)  denotes  the  vector 
[A]  (mod  Ji).---,Am/(mod  j„,/).Am.+l,  ■.■.A*iT. 


Procedure  5.1  (Finding*  the  maximal  pseudo-independent 
partition  by  SNF  approach): 

Input:  Algorithm  (7,13). 

Output:  U -partition  Pu  of  algorithm (J,D). 

Step  1:  Find  a  partitioning  matrix  U  and  the  displacement 
vector  s. 

Step  2:  For  every  index  point  j  S  7,  compute  Uj  (mod 

3)  =  y  and  assign  j  to  7§,  the  block  indexed  byy. 

Step  3:  Pu  -  {75l ,  •  •  • ,  J-u }.  Stop.  □ 

The  complexity  of  Procedure  5.1  is  a  linear  function  of  the 
cardinality  of  the  index  set  7.  i.e.,  the  number  of  computations 
of  the  algorithm.  Also,  it  is  a  polynomial  function  of  n 
and  m.  the  number  of  components  of  index  vectors  and  the 
number  of  the  dependence  vectors,  respectively.  To  construct 
the  displacement  veaor  3,  the  SNF  of  matrix  D  is  needed.  In 
[6],  a  polynomial  algorithm  is  proposed  to  find  the  SNF  of 
any  arbitrary  matrix  A  6  and  the  corresponding  left 

and  right  multipliers  U  and:  V  such  that  17.4V  =  SNF.  The 
complexity  of  the  produa  of  the  number  of  memory  locations 
(space)  and  the  number  of  operations  (time)  of  this  algorithm 
is  0((max{n.  m})10)  [6].  So,  the  space  x  time  complexity  of 
Step  1  is  bounded  above  by  0((max{n,  m})10).  For  Step  2, 
at  most  0(|7|n2)  operations  are  needed  to  compute  the  value 
of  Ug  for  every  index  point  y.  Therefore,  the  total  complexity 
is  bounded  above  by  0((max{n.  m})10)  +  0(\J\n2).  When 
m  =  mf  =  n,  the  complexity  is  bounded  above  by  Oin10)  -t- 
0(n2\J\). 

Example  5.3:  Consider  the  matrix  multiplication  algorithm 
cl  =  0 

t-i,j  ~  t-i.j  4-  Ui, k  '  bk,j ,  k  —  1,  3 

i.j  =  a. 


The  dependence  matrix  and  the  index  set  are  as  follows: 


:  1  <  i,  j,  k  <  s. 


i 

j 

€Z3} 

( 

J 

(5 

Notice  that  the  dependence  matrix  above  only  considers  the 
data  dependence  caused  by  c.  If  data  broadcasting  is  not 
allowed  or  the  dependence  caused  by  data  reads  is  considered, 
then  the  dependence  matrix  for  the  matrix  multiplication 


algorithm  is  D  = 


1  0  0 
0  1  0 
0  0  1 


[11].  Dependences  [1,0, 0]r 


and  [0.1. 0]T  arc  to  avoid  broadcasting  data  and  a 
respectively. 

For  the  dependence  matrix  D  in  (5.1),  matrices  U,  V.  and 
5  and  the  displacement  vector  s  are  as  follows: 


U  = 


0  0  1 
0  1  0 
1  0  0 


5  = 


1 

0 

0 


3  =  [ 


v=[l], 


1  30  3o]r. 


If  3  =  3,  according  to  Definition  5.2,  the  U -partition 

is  Pu  =  {7(0,j,.|r  :  i.j  =  1,2,3}  where  7r0.,.,tT  - 


200 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  41,  NO.  2.  FEBRUARY  1992 


{[t,j,l]T,  [*,  j,2]T)[i,j,3]T}.  There  are  nine  blocks  in  this 
maximal  partition  (|P^|  =  9).  Intuitively,  each  block 
computes  datum  Cij.  If  multiple  copies  of  data  atJ  and  b,j 
can  be  provided,  then  there  is  no  communication/dependence 
among  different  blocks.  Therefore,  this  partition  is  independent 
and  also  maximal.  End  of  example. 

VI.  Comparative  Evaluation 
of  Partitioning  Procedures 

This  section  starts  by  discussing  some  of  the  advantages  of 
the  methods  proposed  in  this  paper  over  the  earlier  pioneering 
method  reported  in  [15]  and  [17].  In  Section  VI-B  the  two 
approaches  proposed  in  this  paper  are  compared. 

A.  Minimum  Distance  Method 

In  the  minimum  distance  method  (abbreviated  MDM)  [15], 
[17],  an  elegant  idea  is  used  which  consists  of  using  a  linear 
mapping  to  transform  the  dependence  matrix  D  into  an  upper 
triangular  matrix  denoted  D *  in  [15].  These  two  dependence 
matrices  are  equivalent  in  the  sense  that  each  dependence 
vector  in  Z>‘,  the  upper  triangular  matrix,  is  a  linear  integer 
combination  of  the  dependence  vectors  in  D  and  vice  versa. 
A  set  of  initial  points,  each  of  which  corresponds  to  a  block  in 
the  resulting  partition,  is  identified  by  Dl  and  the  cardinality 
of  the  partition  is  the  product  of  the  diagonal  elements  of  D *. 
An  independent  partition  is  implicitly  expressed  by  Dl  and 
a  set  of  initial  points.  “Independent  partition”  in  [15]  is  used 
to  denote  the  same  as  “pseudo-independent  partition”  in  this 
paper. 

The  MDM  finds  the  maximal  pseudo-independent  partition 
for  a  class  of  algorithms  which  is  more  restricted  than  the 
methods  proposed  in  this  paper.  For  the  case  where  m'  = 
m  =  n.  the  MDM  generates  the  maximal  pseudo-independent 
partition  and  for  the  case  where  m'  =  m  <  n,  it  generates 
an  independent  partition  that  may  not  be  maximal.  In  [17], 
an  algorithm  to  generate  initial  points  is  presented  for  this 
case.  However,  its  complexity  and  optimality  are  not  clear. 
Moreover,  only  index  sets  of  the  form  J  =  {[ji,  •  •  • ,  jn]T  ■ 
a,  <  ji  <  6,, i  =  l.  -.n}  (not  necessarily  dense)  are 
considered;  otherwise,  the  initial  points  are  not  easy  to  identify. 

Compared  with  the  partitioning  vector  approach  (abbrevi¬ 
ated  PVA)  proposed  in  this  paper,  the  MDM  has  the  following 
disadvantages.  First,  in  the  MDM,  partitions  are  expressed 
implicitly  in  terms  of  the  upper  triangular  matrix  and  a  set  of 
initial  points.  According  to  [15].  to  find  the  upper  triangular 
matrix,  it  is  necessary  to  solve  n  integer  programming  prob¬ 
lems  with  m  variables  which  are  NP-complete,  where  n,  m 
are  u:e  number  of  dimensions  of  the  index  points  and  the 
numoer  of  dependence  vectors,  respectively.  This  is  expensive 
although  it  is  affordable  when  n,m  are  small.  In  the  PVA, 
partitions  are  expressed  explicitly  in  terms  of  the  partitioning 
vectors  and  separating  vectors.  To  obtain  these  vectors,  the 
dominating  computations  required  are  to  find  partitioning 
vectors,  i.e.,  consider  at  most  all  possible  combinations  of 
m'  vectors  from  the  m  dependence  vectors  and  compute 
displli(TDc)~lT.  The  complexity  of  the  execution  time  of 
Procedure  4.1  is  bounded  above  by  0((  ”,  )n3). 


Second,  as  mentioned  above,  in  the  MDM,  blocks  of  the 
resulting  partition  are  implicitly  expressed  in  terms  of  the 
upper  triangular  matrix  and  a  set  of  initial  points.  Although 
the  serial  loops  in  the  original  program  can  be  transformed 
into  parallel  loops  by  the  upper  triangular  matrix,  it  is  costly 
to  know  which  block  a  given  index  point  belongs  to.  If  the 
notation  in  [15]  is  used  here,  given  an  index  point  X  e  Zlxn, 
then  one  way  to  see  which  block  it  belongs  to  is  to  see  if 
equation  A'  =  X,o  +  AD‘  has  an  integer  solution  A  €  Zlxn. 
where  A,o  is  an  initial  point  belonging  to  block  i.  If  it  has, 
then  X  belongs  to  block  i.  If  it  does  not,  then  another  initial 
point  Xj0  belonging  to  block  j,j  #  i,  is  tried  until  an  initial 
point  Xko  is  found  such  that  equation  X  =  Xko  +  AD*  has  an 
integer  solution.  This  can  be  a  very  computationally  expensive 
procedure.  In  contrast,  in  the  PVA,  blocks  of  partitions  are 
explicitly  expressed  in  terms  of  the  vectors.  To  see  which 
block  a  given  index  point  j  €  Zn  belongs  to,  the  computations 
required  are  to  compute  IIj(  mod  a)  and  'Pj.  Finally,  it  must 
be  said  that  despite  its  disadvantages,  the  MDM  provided  the 
initial  inspiration  for  the  ideas  of  this  paper. 

B.  Smith  Normal  Form  Approach  Versus 
Partitioning  Vector  Approach 

In  the  Smith  normal  form  approach  (abbreviated  SNFA),  a 
partitioning  matrix  U  and  the  displacement  vector  s  are  used 
to  construct  the  (/-partition.  As  indicated  by  Definition  5.2, 
if  an  element  s,.  1  <  i  <  m',  of  s  is  equal  to  one,  then 
(/J(modl)  =  0  for  any  arbitrary  index  point  j.  This  means 
that  all  index  points  j  €  /  are  assigned  to  the  same  block 
by  U,.  Therefore,  when  the  value  of  Uj(  mod  a)  is  computed 
in  Step  2  of  Procedure  5.1,  the  ith  row  Ui  of  the  partitioning 
matrix  U  can  be  ignored.  Because  sital  •  •  •  l5ms  it  follows 
that  si  <  •  •  •  <  Sm'  and  only  the  first  k  <m!  elements  of  s 
can  possibly  be  one.  If  the  first  k  elements  of  the  displacement 
vector  s  are  equal  to  one,  i.e.,  «i  =  •  •  •  =  Sk  =  1, 0  <  k  <  m\ 
then  the  first  k  rows  of  matrix  U  can  be  removed  and  only 
the  last  7i  -  k  rows  are  needed  in  Step  2  of  Procedure  5.1  to 
construct  the  U -partition.  When  m'  —  n  and  k  >  n  —  1, 
then  only  one  row  (vector)  is  needed  to  construct  the  U- 
partition  as  in  the  PVA.  The  following  theorem  provides  a 
sufficient  condition  such  that  only  the  last  row  of  U  is  needed 
to  construct  the  (/'-partition. 

Theorem  6.1:  Given  algorithm  (J.  D).  let  s  be  the  displace¬ 
ment  vector.  If  there  exists  a  partitioning  vector  II  such  that 
diapll  =  |  det(7\Dc)|  where  Dc  is  the  determining  matrix  of 
Er, 

II  and  T  =  [  -  -  -  ]  is  such  that  rank(rDc)  =  m',  then 

$1  —  •••  —  Sm  —  1  —  1- 

Proof:  Provided  in  Appendix. 

Theorem  6.1  implies  that  if  n  —  m'  —  1  vectors  are  used  to 
construct  the  a'l'-parntion,  then  only  n  —  m!  + 1  rows  of  matrix 
U  are  needed  to  construct  the  (/-partition.  Therefore,  it  is  not 
true  that  the  SNFA  always  needs  more  vectors  to  construct 
the  (/-partition  than  the  PVA  to  construct  the  Qr^ -partition  if 
m'  >  1. 

Compared  with  the  SNFA,  the  PVA  has  the  following  ad¬ 
vantages  and  disadvantages.  First,  the  SNFA  always  provides 


SHANO  AND  FORTES:  INDEPENDENT  PARTITIONING  OF  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES* 


an 


the  maximal  pseudo-independent  partitions  for  any  uniform- 
dependence  algorithms.  In  contrast,  the  PVA  provides  the 
maximal  pseudo-independent  partitions  only  when  the  uniform- 
dependence  algorithm  satisfies  the  condition  of  Theorem  4.3a. 
Second,  most  algorithms  to  find  Smith  normal  forms  have 
exponential  complexity.  For  the  algorithm  proposed  in  [6], 
although  of  polynomial  complexity,  the  constant  and  the 
exponent  are  large;  in  particular,  large  memory  space  is 
needed.  When  m  =  n  =  ml,  the  complexity  of  the  PVA 
is  0(n5)  4-  0(n\J\)  and  the  complexity  of  the  SNFA  is 
0(n10)  +  0(n2\J\).  So,  generally  speaking,  the  PVA  is  less 
computationally  expensive  than  the  SNFA.  Third,  in  MIMD 
systems,  one  problem  is  to  find  an  optimal  time  schedule  such 
that  the  total  execution  time  plus  the  total  overhead  caused  by 
communication  is  minimized.  In  this  case  the  PVA  is  preferred 
because  the  partitioning  vector  II  could  also  be  used  to  specify 
a  linear  schedule  [20],  [21]. 

vn.  Future  Work  and  Conclusions 

Instead  of  finding  maximal  independent  partitions,  it  may 
also  be  desirable  to  find  maximal  dependent  partitions  such 
that  communication  among  blocks  can  be  supported  ,  by  the 
target  machine.  For  nonpartitionable  algorithms  (an  algorithm 
is  nonpartitionable  if  its  pseudo-partitionability  is  equal  to 
one),  sometimes,  it  may  be  desirable  to  find  the  maximal 
partition  such  that  the  ratio  of  communication  between  a 
block  and  other  blocks  with  the  cardinality  of  that  block  is 
minimized.  The  results  presented  in  this  paper  can  be  used 
as  a  basis  for  future  work  dealing  with  these  problems.  The 
main  contributions  of  this  paper  are  two  computationally 
inexpensive  methods  to  identify  independent  partitions  of 
algorithms  with  uniform  dependencies.  The  resulting  partitions 
are  maximal.  These  methods  can  be  applied  in  practice  as 
one  of  the  many  analysis  procedures  used  by  optimizing 
compilers  to  detect  and  exploit  concurrency  in  serial  programs. 
They  may  be  particularly  useful  in  mapping  algorithms  into 
multiprocessor  machines  where  processors  are  organized  in 
clusters  with  limited  intercluster  communication  capabilities. 
In  these  systems,  different  clusters  can  process  distinct  blocks 
of  a  partition  without  intercluster  communication  overhead 
costs.  Among  others,  such  multiprocessors  include  Cedar  [9] 
and  Cm *  [5]. 


APPENDIX 

Before  the  proofs  are  presented,  some  mathematical  no¬ 
tations  are  introduced.  These  notations  are  based  on  [23]. 
Consider  a  matrix  A  €  R'*xr*  where 


an  ai2 
an  an 


®ln  ‘ 
02  n 


=  [fil.02,  -".On]. 


0*1  On2  •  •  •  Onn  J 


The  cofactors  of  A  are  denoted  —  1,  •••,».  The 

adjugate  (or  adjoint)  matrix  of  A  is  denoted  A*.  Some  facts 
[23]  are  also  listed  Ijere  which  are  used  in  some  of  the 
following  proofs. 


Fact  1: 


An-  A21  -4»tT 

Ai2-  An  •  •  -  An2 


L  A.2  n 


A 


nn  J 


Fact  2:  A-*  —  JetA 

Lemma  8.1:  Let  A  =  [ai,---,^]  6  /f2nxm,rank(A)  = 

f  61  1  , 

m',b  =  •••  €  sp{ai,---,am}  and  T  €  IRm  xn  be 

.bn  I 

such  that  rank(XA)  =  ml .  Then  x  is  a  solution  of  equation 
TAx  =  Tb  if  and  only  if  it  is  a  solution  of  equation  Ax  —  b. 

’ Mi - 

Proof:  =>:  Let  M  =  [A,  6]  = 

Mn_ 

sp{ai, •  •  • , dm}, rank(M)  =  ml  and  rank(!TM)  =  ml.  Let 

"  K\" 


.  Since  b  € 


TM  =  [TA,  Tb]  =  K  = 


Km, 


Since  Kir  -,K„ 


are  linear  combinations- of  Mi ,  ■  •  • ,  Mn  and  rank(if)  = 
rank(M)  =  m',sp{Mi,---,Mn}  =  sp{Ki,  ■  •  •  ,JTm/},i.e., 
Mi,  i  =  1,  •  •  • ,  n,  can  be  expressed  as  linear  combinations  of 
K\  Km‘  •  Let 


‘  7X1  712  • 

‘ "  7lm' 

1 - 

*  •  K 
1 _ 

M  = 

721  722  • 

■  ’  72 m' 

-  7nl  7n2  • 

’  ’  7nm'  - 

_  ^ 

or  [Ab]  =  TK  =  T[TA  Tb].  Then  A  =  VTA  and  6  =  TTb. 
Let  x  be  a  solution  of  equation  TAx  -  Tb,  then  FT  Ax  — 
TTb,  i.e..  Ax  =  b.  Therefore,  x  is  also  a  solution  of  Ax  =  b. 

<s;  Let  i  be  a  solution  for  equation  Ax  —  b.  Then 
TA£  =  Tb  which  implies  x  is  also  a  solution  of  equation 
TAx  =  Tb.  □ 


The  proofs  of  Lemma  5.1  and  Theorem  5.2  are  presented 
first  because  the  proof  of  Lemma  8.2  becomes  much  simpler 
if  the  results  of  Theorem  5.2  are  used. 

Proof  of  Lemma  5.1: 

=>:  Suppose  ji  and  ]i  are  psuedo-connected,  then  by 
Definition  2.4,  there  exists  an  integer  vector  A  €•  Zm  such 
that  DX  =  ji  —j2-  Let  V  6  Zmxm  be  such  that  UDV  —  S  is 
the  SNF  of  matrix  D.  Then  S V~l X  —  U (j\  —ji).  The  fact  that 
V  is  unimodular  implies  that  V~l  is  also  an  integer  matrix  and 
therefore,  V-1  A  is  an  integer  vector.  So*  U(ji  —ji)(mod  5)  — 
5V-lA(mod  5)  =  0.  i.e.,  Uj i(mod  5)  =  Uji(mod  3). 

Suppose  £7ji(mod  3)  =  Uji(mod  3),  i.e.,  U{ji  — 
h)  ~  [jfx»i.y2»2.  --.»m'»m',0.-  -,0]T  where  yu  *  = 
1. •  •  • , ml,  are  integers.  Let  y  =  [pi,j/2,-  -,pw»',0,_  --,0]T  6 
Zm.  then  UQi  -  ji)  =  Sy.  This  implies  that  ji  -Ji  = 
U~lUDVy,  or  Jx  -  ji  —  DVy.  The  fact  that  V  is  integral 
implies  that  Vy  is  integral.  So,  there  exists  an  integral  vector 
A  =  Vy  such  that  ji  -  h  =  DX  which  means  ji  and  j2  are 
psuedo-connected.  □ 

Proof  of  Theorem  5.2:  Let  a  be  the  displacement  vector. 

1)  First,  it  is  shown  that  Pp  is  a  psuedo-indepeodent 
partition.  For  any  two  arbitrary  points  ji  €  7#^  €■  Pu  and- 
J2  €  jg,  6  Per,  y*  /  W,  by  Definition  5.2,  Cfji(mod  3)  =yt 
and  Uji(mod  a)  =  yj.  Because  y*  fz  yj,  Uj\(mod  3}  # 


ir.bfa  IMAWSALUUNS  ON  UJMfU  IUCh  VOL.  41.  NO,  4.  fr£B HUAHX  19*2. 


M2 

U] 2(mod  I).  By  Lemma  5.1,  j'i  and  j?  are  not  psuedo- 
connected  which  implies  thalPir  is- pauedo- independent  Next, 
it  is-  showi*  that  Pu  is  the  maximal'  psuedo- independent  par¬ 
tition.  For  any  two  arbitrary  points  ji,  J2  6  Jg  6  Pu,  by 
definition  of  the  {/-partition  (Definition  5.2),  C/ji(mod  J)  = 
{/J2(mod  s)  =  y.  By  Lemma  5.1,  j\  and  j2  are  psuedo- 
connected.  By  Definition  2.5,  Pu  is  the  maximal  psuedo- 
independent  partition. 

2)  Let  Pu  =  {7y,, •  ■  • ,  }.  Consider  a  block  Jg  €  Pu 

where  y  =  [yi,  •  •  • , y„]T-  Clearly,  0  <  yt  <  sit  i  =  1,  •  •  • , m' 
and  6u  <  y*  <  <5iu.  k  =  m'  4-  1,  •  •  • ,  n.  So  there  are  at  most 
si  x  x  •  •  •  x  sm<  x  7m/ +i  x  •  •  •  x  7„  distinct  block  indexes, 

i.e.,  |Pm»xl  =  \Pu\  £  n*=»i  3h  n»=m'+t”^*-  ^ 

Proof  of  Theorem  3.1: 

1)  Suppose  ]i  and  j2  ate  pseudo-connected,  then  there  exists 
a  vector  A  =  [Ai  •  •  • ,  AmJT  £  Zm  such  that  ji  4-  DA  =-  J2. 
Therefore, 

n j + iida  —  n ji 

or 

rr± 

Iljl  4-  XiUdi  =  II  j2. 

>■1 

By  Definition  3.1,  gcd(Hdi,  •  •  • ,  Ildm)  =  a.  This' implies  that 
13 di  =  otifii  where  A  €  Z,  i  =  1,  •  •  •  ,m.  So, 

m 

IIj2  -  nji  =  a  22  A  a, 

t=l 

where  £HiA*A  is  an  integer  because.  A*,  and  A*  t  = 
1,  •  •  • ,  m,  are  integers.  Then, 

H(J2  -  Ji)(mod  a)  =  0, 

i.e.,  IIji(mod  a)  =  IlJ2  (mod  a).  This  contradicts  to  the 
assumption.  So  Ji  and  jr  are  not  pseudo-connected. 

Consider  the  a-partidon  Pa.  Since  IIJ(mod  a)  =  t,  J  £ 
A  €  Pa,  t  =  0,  •••,£»-  1,  for  any  two  arbitrary  index 
points  j  £  Ji,]'  €  Ji,i  yt  /,  Ilj(moda)  ^  IlJ^mod  a) 
and  they  are  not  pseudo-connected.  By  Definition  2.5,  P  is  a 
pseudo-independent  partition. 

2)  Suppose  that  ~j\  ,J2  are  pseudo-connected.  Then  there 

exists  a  vector  A  €  Zm  such  that  DA  =  (Ji  -  j2)  and, 
therefore,  ’PDA  =  -  J2).  By  Definition  3-3,  'tD  = 

0  which  implies  that  V(ji  —  J2)  =  0,  i.e.,  'ffji  —  'Pj2. 
This  contradicts  the  assumption.  So,  Ji,  Jr  are  not  pseudo- 
connected.  For  the  ^-partition  P*,  let  ji  €  Ah  and  J2  £ 
•4 .  ft  #  ft.  -4, ,y#l  £  Pjr.  The  fact  that  ft,  ^  ft  implies  that 
'fJi  y=  l'J2.  So,  Ji ,  J2  are  not  pseudo-connected.  By  Definition 
225,  P*  is  pseudo-independent 

3)  Similarly,  let  Ji  £  /j.  £  Pa+  and  J2  £  Jg,  £ 

P**,ft  ft  .where  ft  =  [yw,  yu,  •  -  • ,  y(n-m').]Tand  ft  = 
[yot,  yu.  •  •  • ,  y(n-m')/]T-  Since  ft  /  ft,  there  exists-  at  least 
one  dimension  £  £  (0,  l,---,n  -  m'}  such  that  yu  £  yu- 
If  £  =  0,  then  nJi(mod.a)  ^  IIJ2(moda)  and  by  1)  of 
Theorem  3.1,  Ji  and  J2  are  not  pseudo-connected.  If  1  <  t  < 
n  —  mf,  then  #  Vth  and.  by  2)  of  Theorem-  3.1,  ji  and 
Ji  are  not  pseudo-connected.  So,  by  Definition  2.5,  Pa*.  is 
pseudo-independent  G 


Proof  of  Theorem  3.2:  If  it  can-  be  shown  that  there  exists 
at  least  on  index  point-  J  €  J  such,  that  Qj(mod  a)-  =  j„  then 
A  0,  where  A  €-  =  0,  •  — ,  or—  1-  This,  implies*  that 

the  maximal  psuedo-independent  partition  contains- at  least  a 
blocks.  SO,  |Ptn*x|  >  or. 

Let  n  =  [irx  x2  •  •  •  xn].  Since  II  is  a  nonzero  veaor,  it  has 
at  least  one  nonzero  component.  Without  loss  of  generality,  let 
Ti  ft  0.  Let  M  =  to  4-  i  (i.e.,  Af(mod  a)  =  i),  M  £  Z  — 
{0},  t  €  Z,  and  Af  •  t i  >  0.  Since  gcd(iri,  •  •  • , rn)  =  1,  by 
[12]  there  exists  at  least  one  integer  solution  of  the  following 
equation 


triAv  4-  ■  •  •  4-  irn  An  =  3/1..  (8.1) 


Let  z 


be  such  an  integer  solution  of  (8.1).  If 


l2n  j 

zip-  •  * ,  Zn,  >  0,  then  z  £  J  and-  it  has  been  proven  that  for 
any  arbitrary  integer  t,  0  <  i  <  a,  there-  exists- at  least  one 
index-  point  z  €  J  such  that  IIz(mod  a)  =  Af(mod  a)  =  i. 
Now-  suppose  not  all-  z\,  ■  •  • ,  zn  >  0.  It  is  clear  that  alt  the 
solutions  of  (8.1)  take  the  form. 


A  =  z4 


•-t2- 

■-ir3- 

‘-Wn' 

*1 

0 

0 

0 

0 

t2  4- 

0 

£3  4-  -  •  •  4- 

0 

£«  (8.2) 

0 

.  0  . 

.  o  . 

.  *1  . 

where  t2,  £3,  •  •  • ,  t„  are  constants.  This  can  be  verified  as 
follows. 


•-t2- 

'-ir3’ 

*1 

0 

0 

0 

0 

£2  4-  n 

JTl 

0 

£3  4-  •  ■  •  4-  II 

0 

£n 

. ,  . 

0 

.  0  . 

.  0  . 

.  TTi  . 

ia  =  Hz  4-  n 

.  W  J  U  V/  J 

i  +  (— Tl rr2  4-  T27Ti)£2  +  (-X11T3  4-  T3Xi)£3  4- 


Therefore,  A  is  a  solution  of  (8.1).  Next;  it  is  shown  how  a 
nonnegative  integer  solution  of  (8.1)  is  constructed  from  (8.2). 


Let, 


Z\  7T2£2  ^3^3  ‘  ^n£r»- 

Z2  4-  TTi  £2 

Ai 

-3  4-  7rt3 

.A". 

L  Z,  4-  Tjtrt  J 

A  = 


If  ti  >  -  then  A<  >  0.  i  —  2,  •  •  • ,  n.  Let 


— —  4-  A  where  0  <  A  <  1  i 

?ri 


2,- 


-,n. 

(8-3) 


SH.INO  ANP  FORTES:  INDEPENDENT  PAXITnONING  OF  ALGORTTHMSWITH  UNIFORM  DEPENDENCIES 


203 


Now  it  is  shown  that  \i  is  also  greater  than  or  equal*  to>  zero 
iiti,  i  =  2r ■■■,nr  are- defined  by  (&3)t 

A, =  „-g>..  =*,-£(«  -'*)* 

n  n 

1=2  1=2 

=  Zi  -  y*  A*.  H - {M  -  Zjirj) 

t=2 

Af  r-i  - 
=  — 

Tl  £a 

Notice  that  W  >  TZ*i  AN  I  >  LIL2 &*»•  If  M  is 
selected  such  that  M  =  ta  +  i  and  ^  >  £tn=2  |irj|,  then 

M  =  ££  -  I IJLj  -  E,=2>.  l  -  So>  exists 

an  index  point  A  €  J  such  that  ELA(mo&  a)  =  i.  Q 

Proof  of  Theorem  4.1: 

=>:  Let  IT  be  an  equal  partitioning:  vector  of  ( J,D ).  Then 
by  Definition  3.1,  there  exists  a  set  of  m'  linearly  independent 
dependence  vectors  such  that 

114  =  •  -  •  4<  =  disp  IT  >  0 


n dj  =  ajdispU,  dj  6  Z,  j  =  1,  •  •  • ,  m.  (8.4) 

Since  rank(D)  =  ml  and  4 ,  4 ,  •  •  • ,  4<  are  linearly  in¬ 
dependent,  dj,  1  <  j  <  m,  can  be  expressed  as  a  linear 
combination  of  d*, ,  •  •  • ,  d*m, ,  i.e.. 


dj  =  [4,--.,4]  a7>  l<j<m  (8.5) 


n4  =0 

114=0 

nd,m,  =  0. 


Let  N\  be  the  solution  space  of  (8.6)  and  N2  be  the  solution 
space  of  (8.7).  If  II  is  a  solution  of  (8.7),  then  it  is  a  solution  of 
(8.6).  Thus  N2  C  Ni  .  The  dimension  of  JVj  is  n  -  m'  + 1  and 
the  dimension  of  N2  is  n  -  ml.  This  implies  that  N2  c  N\. 
So,  there  exists  at  least  one  solution  El'  of  (8.6)  such  that 
n'4  ^  0,  1  <  j  <  to'.  Let  II  6  Z1XT*  be  such  a  solution 
of  (8.6)  that  ndt,.  =  a  >  0,  j  =  1  •  •  • ,  m',  and  the  greatest 
common  divisor  of  the  n  components  of  II  is.  equal  to  one. 
Then 

n^  =  n[4,---,4,]f  ***  1  =«vv-  ,-«i  ...  ~ 


Since  YliLi  i  =  1. •••,«*,  are  integers,  the  following 
holds. 

ndj(mod  a)  —  0,  j  =  1,  •  •  • ,  m  and  dispU  =  a  >  0. 

By  Definition  3.1,  II  is  an  equal  partitioning  vector  of  ( J,  D).D 
Proof  of  Corollary  4.1: 

1)  Let  us  first  prove  that  if  algorithm  (J,  D)  satisfies  the  first 
condition,  then  it  has  an  equal  partitioning  vector.  Consider  the 
following  two  systems  of  equations: 


II(di  —  d2)  ss  0 
II(di  —  d2)  =  0 

n(dx  -2m)  =  0 


UdjssR^,...,^,]  a 


=  displlfll 


=  diapU.  ^  djj. 

1=1 

By  (8.4),  II  dj  —  ajdispU,  where  a  j  €  Z,  1  <  j  <  m.  So 
E«li  <*» j  =  dj,  1  <  j  <  m,  are  integers. 

<s=:  Consider  the  following  two  systems  of  equations: 

n(4  -4)  =0 

,  n(4-4)=o  (8.6) 

,n(4  -4)  =0 


'  Tidy  =  0 

t  n  d2  =  0 
•  n4  =  0. 

Using  reasoning  similar  to  that  used  in  the  second  part  of 
the  proof  of  Theorem  4.1,  it  can  be  shown  that  there  exists 
at  least  one  solution  II'  of  (8.8)  such  that  Il'dj  ^  0,  1  < 
j  <  m.  Let  II  €  Zlxn  be  such  a  solution  of  (8.8)  that 
Udj  =  disp  U  >  0,  j  =  1,  •  •  •  ,m.  and  the  greatest  common 
divisor  of  the  n  components  of  II  is  equal  to  one.  Then 
Ildj  =  disp  II  >  0,  j  =  1,  •  •  • ,  m.  Therefore,  O,  is  an  equal 
partitioning  vector. 

2)  By  the  assumption  of  condition  2,  there  exists  a  set  of 
ml  linearly  independent  dependence  vectors  4  >  *  *  • .  4'  suc^ 
that  all  dependence  vectors  can  be  expressed  as  an.  integer  lin¬ 
ear  combination  of  4 » ■  •  • ,  4  •  »•*•»  4f  =  HZi  «t#4  -  J  = 
1, ,m,  where  o,->,  i  =  1  ,  and  j  —  1  r  -,m, 

are  integer  constants.  Clearly,  t  a if,  j  —  1,  •  •  • ,  mT  are 
integers.  According  to  Theorem  4.1,  ( J,D\  has  an  equal 
partitioning  vector.  □ 


20* 

Lemma  8.2:  Let  m!  =  n,  II  be  a  partitioning  vector  of  algo¬ 
rithm  (J, D)  determined  by  dr,, •••,£„,  A: 
and  7i,72  €  J.  If  jdet  Dc\  =  disp  II  and  II(Ji  -  72)(mod 
disp  II)  =  0,  then  Jj  and  72  are  pseudo-connected. 

Proof:  Consider  algorithm  ( J,DC ).  Its  algorithm  coef¬ 
ficient  a'  is  equal  to  jdetDc|  because  o'  =  gcd(IIdtt ,  •  •  • , 
nit,)  as  gcd(  displl,  •  •  • ,  displl)  =  dispU  =  |detDc|.  Let 
?  =  [s'i,  •  •  • ,  s^]T  be  the  displacement  vector  of  algorithm 
( J,DC ).  By  Theorem  5.1,  11?=!  3 i  =  |  det  Dc | .  By  Theorem 
5.2,  there  are  at  most  |  det  Dc\  blocks  in  the  maximal  pseudo¬ 
partition  of  algorithm  ( J,DC )  and  by  Corollary  3.2  |/*|  = 
o'  =  |  det  Dc\.  So,  the  a’ -partition  is  the  maximal  pseudo¬ 
independent  partition.  Because  11(71  -  72) (mod  disp  II)  = 
0.  ji  and  72  are  in  the  some  block  of  the  a-partition.  Therefore, 

71  and  72  are  pseudo-connected.  □ 

Proof  of  Theorem  4.3: 

1)  By  Theorem  3.1,  71,72  are  pseudo-connected  only  if 
Il7i(moda)  =  Il^mod  a).  Let’ us  prove  that  71,72  are 
pseudo-connected  if  II71  (mod  a)  =  Il  ^mod  a).  Because 
n(jr  -72)(mod  a)  =  0,  it  follows  II(ji  -  72)  =  iidispU.  4- 
72a,0  <  72a  <  dispU  and  71,72  €  Z.  If  72  =  0,  by  Lemma 
3.2, 71  and  72  are  pseudo-connected.  Suppose  72  ^  0.  Because 
gcd(IIdi,  •  • ,  Ildnt)  =  ot.  By  [12],  there  exists  at  least  one 
integer  solution  of  the  following  equation: 

Ailldi  4-  . . .  4-  AmIIdm  =  o.  (8.9) 

Let  A  =  [Ai,  •  •  • ,  Am]T  be  an  integer  solution  of  (8.9).  Then 

72  A  is  an  integer  solution  of  the  following  equation: 

72 A 1  lid t  4-  •  ■  •  4-  72AmIIdm  =  7iO.  (8.9a) 
Let  7  =  72  4-  liDX.  Qearly,  7  and  72  are  pseudo-connected 
and  by  (8.9a) 

U(j  -  72)  =  72-*iIIdi  -I - 4-  72AmIIdTO  =  720. 

Therefore,  H(]i  -  7  )  =  II71  -  II7  4-  Uj2  -  II72  =  _n(ji  - 
72)  -  n(7  -  72)  =  7id«pII  4-  720  -  72a.  So,  11(71  -  7)(  mod 
displl)  =  0.  By  Lemma  8.2,  71  and  7  are  pseudo-connected. 
Since  71 , 7  and  7 , 72  are  pseudo-connected,  respectively,  71  and 
72  are  also  pseudo-connected.  Therefore,  if  II(jFi  ~72)(mod 
O)  as  0,  then  7i  and  72  are  pseudo-connected. 

2)  Consider  the  a-partition  Pa,  since  nji(mod  a)  = 
Ilj^mod  a)  where  71,72  €  Ji  €  Pa  are  arbitrary  index 
points.  0  <  i  <  o,  by  the  results  in  1)  of  Theorem  4.3, 
7i,72  are  pseudo-connected.  By  Definition  2.5,  Pa  is  the 
maximal  pseudo-independent  partition,  i.e.,  Pa  =  PmaJt  and 

\Pam*\  =  1^1=0.  .  .  □ 

Lemma  8.3:  Let  Jg  £  P*,  then  Jg  =  {7  :  7  £  J.j  = 
p  4-  Di.i  S  Rm,'6p  =  y}._ 

Proof:  Let  S  denote  {7  :  7  €  J,j  ~  p_4-  Dx.x  £ 
fft"\  'i/p  =  y}.  If  7  €  Jg,  then  $7  =  y  and  #(7  -  p)  =  0. 
By  the  Fundamental  Theorem  of  Linear  Algebra  [23,  pp.  15, 
87],  7  -  p  belongs  to  ap{di,  •  •  • ,  in)  because  ilD  =  0.  This 
implies  that  there  exists  an  x  £  Rm  such  that  7  —  p  =  Dx. 
i.e.,  7  =  p  +  Dx.  So,  7  6  5  and  Jg  C  S.  Now,  let  7  €  S,  then 
7  =  p  4-  Di  and  Vj  =  'ip  4-  i/Dx  =  *5p  =  y.  So  7  £  Jg  and 
Jg  =  S.  _  □ 

Lemma  8.4:  Let  d*, ,  ■  • ,  be  linearly  independent,  De  = 
{dtx,-  -,d*J,T  €  Zm  Xn  be  such  that  rank  (TDe)  = 
ml,  Jg  £  P*  and  Lg  =  {Tj  :  7  €  Jg}-  Then,  1)  the  mapping 


SEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41,  NO.  2.  FEBRUARY  1992 

T :  Jg  Lg,T(l )  =  Tj,  7  €  jg  is  bijective  and  2)  Tj\ ,  Tj2  £ 
Lg  are  pseudo-connected  in  algorithm'  (Lg,  TD)  if  and  only 
•f  71.72  €  Jg  are  pseudo-connected  in  algorithm  ( Jg,D ,  i.e., 
7i  -  72  =  -DA  if  and  only  if  X(ji  -  72)  =  TUA,  I  €  Zm. 

Proof: 

1)  Consider  the  mapping  T :  Jg  -*  Lg,  T(j)  =  Tj,  j  £ 

Jg.  Since  Lg  =  {Tj  :  7  €  Jg},  T  is  surjective.  By  Lemma 
8.3,  Jg  =  {7  :  7  €  J.j  =  p  4-  Ar,x  €  JPm,  i5p  =  y}.  Since 
dtl .  -  -  • ,  d,m  are  linearly  independent  and  rank(D)  =  m1,  each 
dependence  vector  can  be  written  as  a  linear  combination  of 
dti,---,dt„,  i.e.,  D  =  DCA  where  A  £  Rm  Xm.  So,  Jg  can 
be  rewritten  as  Jg  =  {7  :  7  =  p  4-  Dcz,z  =  Ax,  j  £  J,x  £ 
Rm  ,  =  y}-  Let  71, 72  £  Jy  and  71  =  p  +  Dczuj2  = 

P  4-  Dcz2,  then  T{j\  -  72)  =  TDc{z\  -  z2)  =  0  if  and  only 
if  z\  —  z2,  or  equivalently,  71  =  72,  since  rank(TD)  =  m'. 
In  summary,  it  has  been  shown  that  Tji  =  Tj2  if  and  only  if 
71  =  72.  So  T  is  injective  which  implies  T  is  bijective. 

2)  =>:  If  71,72  6  Jg  are  psuedo-connected,  then  there  is  a 
vector  A  €  Zm  such  that  71  =  j2+DX.  So,  Tj\  =  Tj2+TDX 
and  Tji,Tj2  are  pseudo-connected. 

(<=)  :  If  Tji,Tj2  are  pseudo-connected,  then  there  exists 
a  vector  A  £  Zm  such  that  T{j\  —  j2)  =  TDX.  Index  points 
7i,72_€  dg implies  that  71  -  72  €  sp{d\,-_-  ■  ,dm}-  By  Lemma 
8.1,  A  is  also  a  solution  of  equation  DX  =  (j\  —  j2)  which 
implies  that  71  and  j2  are  pseudo-connected,  □ 

Proof  of  Theorem  4.3a: 

1)  =►:  see  Theorem  3.1. 

<=:  Since  'hji  =  ^j2,ji  and  j2  belong  to  the  same  block 
of  the  '5-partition,  i.e.,  71,72  €  Jg  6  P*,  where  y  =  'Sji. 
Let  Lg  =  {T7  :  7  e  Jg}  and  =  TD.  Consider  the 
algorithm  (Lg,  A),  let  T  be  the  partitioning  vector  determined 
by  Tdt,,-  •  •  ,Tdt^  and  o'  =  gcd(ITdi,  •  •  •  .ITdm)  be  the 
algorithm  coefficient  for  algorithm  (Lg,  A).  By  Theorem  4.2, 
T  =  0'l(TDr)~l  and  disp  T  =  /3',  where  0'  6  iV+  is 
such  that  the  greatest  common  divisor  of  the  m!  components 
of  T  is  equal  to  one  and  T  £  Zlxm  .  Now,  consider  the 
row  vector  II  =  {l/(3")TT  =  (3l(TDc)~lT  where  0  — 
[0'  10"),  0  €  iV+  is  such  that  the  greatest  common  divisor  of 
the  n  components  of  II  is  equal  to  unity  and  II  £  Zlxn.  By 
theorem  4.2,  II  is  a  partitioning  vector  determined  by  vectors 
d\,  --,dm  for  algorithm  (J,L>).  Therefore,_  the  algorithm 
coefficient  for  algorithm  ( J.  D)  is  o  =  gcd(IIdi,  •  •  • ,  HflU)  = 

gcd((i//nrrii,  •  ■;  .(ljnrTin)  =  o'//?".  _ 

By  assumption  II(7i-72)(moda)  =  0,  i.e.,  11(71-72)  =  Aa 
where  A  £  Z.  Because  o  =  a'/0 ",  Uji  =  (l/0")FTji 
and  II72  =  (l/0")FTj2,  U(ji  -  72)  =  (l//3")r(T7i  - 
Tj2)  =  A  a'  10".  or  r(rji  -  Tj2)  -  Xa'  which  means 
r(rji-r72)(modo/)  =  0.  That  is,  if  II(7i -72)(moda)  =  0, 
then  nrji  -  r72)(mod  o')  =  0.  Notice  that  the  dimension 
of  the  index  points  of  algorithm  (Lg,  A)  is  m‘  and  the  rank 
of  its  dependence  matrix  A  is  ml .  According  to  Theorem 
4.3  1),  for  any  two  arbitrary  index  points  Tj\,Tj2  €  Lg, 
if  r(Tji  -  Tj2)(mod  o')  =  0.  then  TjXlTj2  are  pseudo- 
connected.  According  to  Lemma  8.4,  Ji,J2  6  Jy  are  also 
pseudo-connected  in  algorithm  (Jy,D).  In  summary it  has 
been  proven  that  for  any  two  arbitrary  index  points  ji,]2€  J, 
if  Ilfo  -  J2)(modo)  =  0  and  'ijl  =  i/j2,  then  7 \J2  are 
pseudo-connected. 


SHANO  AND  FORTES:  INDEPENDENT  PARTITIONING  OF  ALGORITHMS  WITH  UNIFORM  DEPENDENCIES 


70S- 


2)  Consider  any  two  arbitrary  points  €  Jy.  € 

■Pa*.  i  =  1, Since  Ilfo  -  J2)(mo<i  a)  =  0  and- 
ifQi  -  J2)  =  5,  by  the  results  in  Theorem  4.3a  1),  they 
are  pseudo-connected.  By  Definition  2.5,  Pa*  is  the  maximal 
pseudo-independent  partition. 

3)  Let  P%  denote  the  a-partition  for  algorithm  (Ly,  A). 

First,  |i*j|  <  a  because  there  are  at  most  a  nonempty  blocks  in 
P%.  Second,  for  P#,  clearly,  there  are  at  most  +  1) 

nonempty  blocks,  where  7<  =  max{'Pi(J1  -  j2)  :  ji,j2  € 
J},  i  =_1,  •  •  • ,  n  -  ml.  So,  |P*|  <  nr=T'(7,  +  l).The  fact 
that  \P%\  <  a  implies  that  \Pm*x\  <  max{|P|j  :  Jy  € 
P*}|P*|  =  a|P*|.  By  Theorem  3.1  and  Corollary  3.2,  Pa 
is  a  pseudo- independent  partition  of  (J,  D)  and  |Pa|  =  a.  So 

a  <  |P maxi  <  a|P*|.  □ 


Proof  of.-  Theorem*  6J:  Without  loss-  o&  generality,  let 
det(TZ)i)  > '  (W  Because"  dispW  =>  det(7*Dc),  by-  Theorem- 
\X  n  =  dispTSHTDe)~lT  =  tet{TDc)T{Trrc)-lT. 
By  fact  2,  II  =  l(TDe)*T.  Let  F  =  [71, •••,7m']  = 
I (TDcy,  then  gcd(7i,- ■  -,7m')  =  gcd(tri, ,ffn)  =  1.  Let 
{TDc)ijT  i,  j  =  1,  •  •  •  ,m'  be  the  cofactors  of  matrix  ( TDC ). 
Then  by  Fact  1,  T  =  E^TAOv,  •  •  • , 
i.e.,  7.  =  E£=i (TDe)ij,i  =  So,  gcd^^ 

(m)y,-,Ei(mWi)  =  gcd(7i t  •  •  • , 7m' )  =  1 
which  means  gcd ((TDc)ij,i,j  =  1, •••,ml)  =  1. 
This  implies  that  the  greatest  common  divisor  of  all 
subdeterminants  of  order  ml  —  1  is  equal  to- 1.  By  Theorem 
5.U  nilf1  3»  =  1-  Therefore,  a* .  =  •  •  •  —  am,_i  —  1.  □ 


List  of  Symbols 

D:  dependence  matrix  with  n  rows  and  m  columns;  see  Definition  2.1  4). 

Dc:  determining  matrix;  see  Definition  3.1. 

di4.  dependence  (column)  vector  with  n  components;  see  Definition- 2.1  4). 

det(i4):  determinant  of  matrix  A. 

dispU.:  a  positive  integer;  see  Definition  3.1. 

E:  the  set  of  edges  of  the  algorithm  dependence  graph;  see  Definition  2.2. 

Ei:  row  vector  with  n  components  whose  entries  are  zero  except  that  the  ith  entry  is  one. 
I:  identity  matrix. 

R:  set  of  real  numbers. 

J:  index  set;  see  Definition  2.1  1). 

J,:  a  block  of  a  partition;  see  Definitions  23,  23  and  3.2. 

J~:  a  block  of  a  partition;  see  Definitions  3.4,  3.5  and  5.2. 

J:  index  point  (column  vector);  see  Definition  2.1  1). 
m:  number  of  dependence  vectors  in  D;  see  Definition  2.1  4). 
ml:  the  rank  of  the  dependence  matrix  D;  see  Definition  2.1  4). 

N:  set  of  nonnegative  integers. 

;V+:  set  of  positive  integers. 

n:  number  of  components  of  index  points  in  J ;  see  Definition  2.1  1). 

P;  a  pseudo- independent  partition;  see  Definition  23. 

Pmmx:  maximal  pseudo-independent  partition;  see  Definition  2.5. 

Pa:  a-partition;  see  Definition  3.2. 

P*;  '{'-partition;  see  Definition  3.4. 

Pa*:  a '{'-partition;  see  Definition  3.5. 

Pu :  U -partition;  see  Definition  5.2. 
p:  independent  partition;  see  Definition  2.3. 
rank(i4):  rank  of  matrix  A. 

5:  the  Smith  normal  form  for  dependence  matrix  D;  see  Theorem  5.1. 

I:  displacement  vector  (column);  see  Definition  5.1. 

sp{vi,  ■  ■  ■ .  ui,}:  the  vector  space  spanned  by  vectors  ,vk- 

SNF:  abbreviation  of  Smith  Normal  Form. 

U:  left  multiplier  of  the  Smith  normal  form;  see  Theorem  5.1. 

V:  right  multiplier  of  the  Smith  normal  form;  see  Theorem  5.1. 
x :  a  real  column  vector. 

y:  index  vector  of  blocks  of  a  partition;  see  Definitions  3.4,  3  J,  and-  5.2. 

Z:  set  of  integers. 

a:  algorithm  coefficient  (positive  integer);  see  Definition  3.1. 
p:  number  of  blocks  in  the  (/-partition;  see  Definition  52. 
v:  number  of  blocks  in  the  a '{'-partition;  see  Definition  3  3. 

II:  partitioning  v^tor  (row);  see  Definition  3.1. 

'{':  separating  matrix  see  Definition  3.3. 


206 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  4’..  NO.  2.  FEBRUARY  1992 


<]>,:  separating  vector  (row);  see  Definition  3.3. 

V:  number  of  blocks  in  the  ^-partition;  see  Definition  3.4. 

0:  empty  set. 

1:  a  column  or  row  vector  whose  entries  are  all  1. 

0:  a  column  or  row  vector  whose  entries  are  all  0. 

|P|:  cardinality  of  set  P. 

|c|:  the  absolute  value  of  scalar  c. 

A  -  B:  set  {x:x  €  A,x  £  B},  where  A  and  B  are  sets. 
a|6:  a  divides  6,  a,  6  €  Z. 

6(mod  a):  modulo  operation,  i.e.,  6(mod  a)—  d  iff  a|(6  -  d)  where  0  <  d  <  a. 


Acknowledgment 

The  authors  thank  S.  H.  Zak,  M.  O’Keefe,  and  V.  Van 
Dongen  for  their  discussions  on  Section  V.  Also,  J.-K.  Peir 
provided  valuable  insights  on  the  MDM  and  explained  how  it 
could  be  used  to  partition  the  algorithm  of  Example  5.2  in  [22]. 

References 

[  1  ]  U.  Banerjee.  S.  C.  Chen.  D.  I.  Kuck.  and  R.  A.  Towle.  “Time  and  parallel 
processor  bounds  for  FORTRAN -like  loops.”  IEEE  Trans.  CompuL.  vol. 
C-28,  pp.  660-670.  Sept.  1979. 

[2]  R.  Cytron,  “Doa cross:  Beyond  vectorization  for  multiprocessors."  in 
Proc.  1986  InL  Conf.  Parallel  Processing,  pp.  836-844. 

[3]  J.  A.  B.  Fortes.  “Algorithm  transformations  for  parallel  processing  and 
VLSI  architecture  design."  Ph.O.  dissertation,  Dep.  Elec.  Eng.-Syst., 
Univ.  of  Southern  California,  Dec.  1983. 

[4]  D.  D.  Gajski  and  J.-K.  Peir,  “Essential  issues  in  multiprocessor  systems." 
IEEE  Compui.  Mag.,  vol.  18,  pp.  9-27,  June  1985. 

[5]  K.  Hwang  and  F.  A.  Briggs.  Computer  Architecture  and  Parallel  Pro¬ 
cessing.  New  York:  McGraw-Hill.  1984. 

[6]  R.  Kannan  and  A.  Bachem,  “Polynomial  algorithms  for  computing 
the  Smith  and  Hermite  normal  forms  of  an  integer  matrix,"  SIAM  J. 
CompuL,  vol.  8,  no.  4,  pp.  499-507.  Nov.  1979. 

[7]  R.M.  Karp.  R.E.  Miller,  and  S.  Winograd.  “The  organization  of 
computations  for  uniform  recurrence  equations,"  J.  ACM.  vol.  14.  no.  3, 
pp.  563-590,  July  1967. 

[8]  D. C.  Kuck.  A.H.  Sameh.  R.  Cytron.  A.V.  Veidenbaum,  CD.  Polv- 
chiooopoulos.  G.  Lee.  T.  McDaniel,  B.  R.  Leasure,  C.  Beckman,  J.  R.  B. 
Davies,  and  C.  Kruskal.  “The  effects  of  program  restructuring,  algorithm 
changes,  and  architecture  choice  on  program  performance,"  in  Proc. 
1984  Int.  Conf.  Parallel  Processing,  pp.  129-138. 

[9]  D.  Kuck.  E  Davidson.  D.  Lawrie.  and  A  H.  Sameh.  “Parallel  supercom¬ 
puting  todav  and  the  Cedar  approach."  Science,  vol.  231,  pp.  967-974, 
Feb.  28,  1986. 

[10]  D.  I.  Moldovan  and  J.A.  B.  Fortes.  “Partitioning  and  mapping  algo¬ 
rithms  into  fixed  size  systolic  arravs."  IEEE  Trans.  CompuL.  vol.  C-35, 
pp.  t-12,  Jan.  1986. 

[11]  D.  1.  Moldovan.  “On  the  design  of  algorithms  for  VLSI  svstolic  arrays." 
Proc.  IEEE.  vol.  71,  pp.  113-120.  Jan.  1983. 

[12]  L.  Mordell.  Dtophantme  Equations  New  York:  Academic.  1969.  p.  30. 

[13]  D.  A.  Padua.  “Multiprocessors:  Discussion  of  theoretical  and  practi¬ 
cal  problems."  Ph.D.  dissertation.  Rep.  UIUCDCS-R-79-990.  Univ.  of 
Illinois  at  Urbana-Champaign.  Urban  a.  IL,  Nov.  1979. 

[14]  D.  A  Padua.  D.J.  Kuck.  and  D.  L  Lawrie.  “High  speed  multiprocessor 
and  compilation  techniques."  IEEE  Trans.  CompuL,  vol.  C-29.  pp. 
763-776,  Sept.  1980. 

[  15]  J.-K.  Petr  and  R.  Cytron.  “Minimum  distance:  A  method  for  partitioning 
recurrences  for  multiprocessors."  in  Proc.  1987  InL  Conf.  Parallel 
Processing,  pp.  217-22: 

[16]  J.-K.  Peir  and  D.  D.  Gajski.  “CAMP:  A  programming  aid  for  multipro¬ 
cessors,"  in  Proc.  1996  InL  Conf.  Parallel  Processing,  pp.  475-482. 

[17]  J.-K.  Peir.  “Program  partitioning  and  synchronization  on  multiprocessor 
systems,"  Ph.D  dissertation.  Rep.  UTUCDCS-R-86-1259,  Dep.  CompuL 
Sci.  Univ.  of  Illinois  at  Urbana-Champaign.  Urbana,  IL,  Mar.  1986. 

[18]  CD  Polychrooopoulos.  D.J.  Kudu  and  D.A  Padua,  “Execution  of 
parallel  loops  on  parallel  processor  systems,"  in  Proc.  1996  InL  Conf. 
Parallel  Processing,  pp.  519-527. 

[19]  A  Schrijver,  Theory  of  Linear  and  Integer  Programming.  New  York: 
Wiley,  1986. 


[20]  W.  Shang  and  J.  A  B.  Fortes,  “Tune  optimal  linear  schedules  for 
algorithms  with  uniform  dependencies."  in  Proc.  Int.  Conf  Systolic 
Arrays,  May  1988,  pp.  393-402. 

[21]  — — ,  “Partitioning  of  uniform  dependency  algorithms  for  parallel 
execution  on  MIMD/systolic  systems.”  Tech.  Rep.  TR-EE  88-18.  School 
of  Electrical  Engineering,  Purdue  Univ.,  W.  Lafayette,  IN  47907.  Apr. 
1988. 

[22]  ,  “Independent  partitioning  of  algorithms  with  uniform  dependen¬ 
cies,"  in  Proc.  1988  liu.  Conf  Parallel  Processing,  Vo!  2  Software,  pp. 
26-33. 

[23]  G.  Strang.  Linear  Algebra  and  its  Applications,  second  ed.  New  York: 
Academic,  1980. 

[24]  0.  Veblen  and  P.  Franklin.  “On  matrices  whose  elements  are  integers,” 
Ann.  of  Mathematics,  voi.  23  (1921-2),  pp.  1-15. 

[25]  M.  J.  Wolfe,  “Optimizing  Supercompiiers  for  Supercomputers,"  Ph.D 
dissertation.  Rep.  U1UCDCS-R-82-1105,  Univ.  of  Illinois  at  Urbana- 
Champaign,  Urbana.  IL,  1982 


Wdjia  Sluing  (S'88-M'90)  received  the  B.S.  de¬ 
gree  in  computer  engineering  and  science  from 
Changsha  institute  of  Technology  in  1982  and  the 
M.S.  and  Ph.D.  degrees  in  electrical  engineering 
from  Purdue  University,  West  Lafayette.  IN.  in  1984 
and  1990,  respectively. 

She  is  currently  an  Assistant  Professor  in  the  Cen¬ 
ter  for  Advanced  Computer  Studies,  the  University 
of  Southwestern  Louisiana.  Lafayette.  Her  research 
interests  include  parallel  processing,  computer  ar¬ 
chitecture.  algorithm  transformation,  processor  array 
programming,  special  purpose  VLSI  bit-level  processor  array  design,  and 
optimizing  compiler  technique. 

Dr.  Shang  is  a  member  of  the  Association  for  Computing  Machinery. 


Jose  A  B.  Fortes  (S’80-M’83)  received  the  Li- 
cenciatura  etn  Engenhana  Elect rotecnica  from  the 
Universidade  de  Angola  in  1978.  the  M.S.  degree 
in  electrical  engineering  from  Colorado  State  Uni¬ 
versity.  Fort  Collins,  in  1981.  and  the  Ph.D.  degree 
in  electrical  engineering  from  the  University  of 
Southern  California.  Los  Angeles,  in  1984. 

In  1984.  he  joined  the  faculty  of  the  School 
of  Electrical  Engineering.  Purdue  University.  West 
Lafayette,  IN.  where  he  currently  is  an  Associate 
Professor.  From  July  1989  through  July  1990  he 
served  at  the  National  Science  Foundation  as  program  director  for  micro¬ 
electronics  systems  architecture.  His  research  interests  are  in  the  areas  of 
parallel  procMaing,  fault- tolerant  computing,  and  VLSI  computer  architecture 
oo  which  he  co-authored  over  50  papers.  His  research  has  been 

funded  by  the  Office  of  Naval  Research,  /CAT  Foundation.  General  Electric, 
and  the  National  Science  Foundation. 

Dr.  Fones  ia  a  member  of  the  IEEE  Professional  Society.  He  is  on  the 
Editorial  Boards  of  the  Journal  o f  Parallel  and  Distributed  Computing  and 
the  Journal  of  VLSI  Signal  Processing. 


REFERENCE  NO.  5 


Rau,  D.,  Fortes,  J.  A.  B.  and  Siegel,  H.  J.,  "Destination  Tag  Routing  Techniques 
Based  on  a  State  Model  for  the  IADM  Network,"  IEEE  Transactions  on  Computers, 
Volume  41,  Number  3,  March  1992,  pp.  274-286. 

Note  -  This  paper  proposes  a  simple  destination-tag  routing  mechanism  for  a  class  of 
networks.  It  also  shows  how  the  message  routing  scheme  supports  the  rerouting  of 
messages  around  faulty  nodes  and  links.  These  results  are  of  use  in  fault-tolerant  pro¬ 
cessor  arrays  where  a  multistage  network  is  used  to  connect  processors. 


n* 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL  41.  NO.  3,  MARCH  1992 


Destination  Tag  Routing  Techniques  Based 
on  a  State  Model  for  the  IADM  Network 

Darwen  Rau,  Jose  A.  B.  Fortes,  Member,  IEEE,  and  Howard  Jay  Siegel,  Fellow,  IEEE 


Abstract — A  “state  model”  is  proposed  for  solving  the  problem 
of  routing  and  rerouting  messages  in  the  Inverse  Augmented 
Data  Manipulator  (IADM)  network.  Using  this  model,  necessary 
and  sufficient  conditions  for  the  reroutability  of  messages  are 
established,  and  then  destination  tag  schemes  are  derived.  These 
schemes  are  simpler,  more  efficient  ant*  require  less  complex 
hardware  than  previously  proposed  ranting  schemes.  Two  des¬ 
tination  tag  scheme*  are  proposed.  For  one  of  the  schemes, 
rerouting  is  totally  transparent  to-  the  sender  of  the  message  and 
any  blocked  link  of  a  given  type  cam  be  avoided.  Compared  with 
previous  works  that  deal  witla  the  same  type  of  blockage,  the 
time  x  space  complexity  I*  reduced  (tons  Oi  log  S'  tn  0(1). 
For  the  other  scheme;  rerouting  is  possible  for  any  ,/pe  of  link 
blockage.  A  universal  rerouting  algorithm-  is  constructed  based 
on  the  second  scheme;  which  And*  a  blockage- free  path  for  any 
combination  of  multiple  blockages  if  there  exists  such  a  path,  and 
indicates  absence  of  such  a  path  if  there  exists  none.  In  addition, 
the  state  model  is  used  to  derive  constructively  a  lower  bound  on 
the  number  of  subgraphs  which  are  isomorphic  to  the  Indirect 
Binary  n-Cube  network  in  the  IADM  network.  This  knowledge 
can  be  used  to  characterize  properties  of  the  IADM  networks  and 
for  permutation  routing  in  the  IADM  networks. 

Index  Terms — Cube  network,  data  manipulator  network, 
destination-tag  routing,  butt  tolerance,  interconnection  network, 
multiprocessor,  parallel  processing,  state  model. 


I.  Introduction 

THIS  paper  discusses  novel  and  efficient  techniques  for 
routing  and  rerouting  messages  in  the  Inverse  Augmented 
Data  Manipulator  (IADM)  network  [9].  These  results  are  based 
on  a  new  approach,  the  “state  model.”  which  characterizes  and 
correlates  the  topologies  of  the  IADM  and  Indirect  Binary 
n-Cube  networks,  and  leads  to  efficient  exploitation  of  the 
redundancy  available  in  the  IADM  network. 

Considerable  research  has  been  dedicated  to  the  design  of 
multistage  interconnection  networks  for  multiprocessor  sys¬ 
tems.  The  class  of  data  manipulator  r  etworks.  introduced  in 
[3],  includes,  among  others,  the  Augn.<.nted  Data  Manipulator 
(ADM)  network  [17],  the  IADM  network  [9],  and  the  Gamma 
network  [13],  [14],  The  IADM  network  and  the  ADM  network 

Manuscript  received  October  15.  1987:  revised  August  12. 1991.  This  work 
was  supported  in  part  by  the  National  Science  Foundation  under  Grant  DO- 
3419745,  by  the  Innovative  Science  and  Technology  Office  of  the  Strategic 
Defense  Initiative  Organization  and  was  administered  through  the  Office  of 
Naval  Research  under  Contract  OOOI4-83-k-0723,  and  by  the  Supercomputing 
Research  Center  under  Contract  MDA904-35-C-5027. 

D.  Rau  was  with  the  School  of  Electrical  Engineering,  Purdue  Univer¬ 
sity.  West  Lafayette.  IN  47907.  He  is  now  with  AT  AT  Bell  Laboratories. 
Naperville.  1L  60566. 

I.  A  3.  Fortes  and  H.  J.  Siegel  are  with  the  Parallel  Processing  laboratory. 
School  of  Electrical  Engineering,  Purdue  Umvenitv,  West  Lafavette,  IN 
47907. 

IEEE  Log  Number  9105560. 


differ  only  in  that  the  input  side  of  one  of  them  corresponds  to 
the  output  side  o.  >e  other  and  vice  versa.  The  Gamma  and  the 
LADM  networks  are  topologically  equivalent;  however,  they 
use  switches  of  different  types.  Each  3x3  crossbar  switch 
used  in  the  Gamma  network  can  connect  simultaneously  all 
three  inputs  to  all  three  outputs  whereas  each  switch  used  in 
the  LADM  network  can  connect  only  one  of  its  three  inputs  to 
one  or  more  of  its.  three  outputs.  The  main  interest  of  this  paper 
is  the  study  of  the  IADM  network;  both  the  one-to-one  and 
permutation  routings  are  considered.  The  schemes  proposed 
for  routing  and  rerouting  messages  in  the  IADM  network  are 
also  applicable  to  the  Gamma  network. 

Perhaps  the  most  popular  class  of  multistage  networks  is 
the  multistage  cube-type  networks  such  as  the  Indirect  Binary 
n-Cube  [15],  Omega  [6],  Baseline  [20],  Generalized  Cube 
[18],  STARAN  flip  [2],  and  a  special  case  of  SW-Banyan 
[4]  networks.  Among  the  main  advantages  of  these  networks 
are  their  very  efficient  destination  tag  routing  schemes,  par- 
titionability,  0(  N  log2  N)  cost,  and  ability  to  pass  useful 
permutations  [16].  Some  results  of  this  paper  are  based  on 
characteristics  of  the  Indirect  Binary  n-Cube  network  (hereon 
referred  to  as  the  l Cube  network).  Since  the  cube-type  net¬ 
works  mentioned  above  are  all  topologically  equivalent  [16], 
[17],  [20],  [21],  the  results  in  this  paper  are  also  relevant  to 
any  of  them. 

The  ICube  network  is  composed  of  n  =  log  S  stages 
labeled  from  0  to  n  —  1.  Each  stage  consists  of  N  connection 
links  and  iV/2  interchange  (switches)  boxes.  The  structure  of 
the  network  is  such  that  two  input  links  of  an  interchange  box 
differ  only  in  the  ith  bit  of  their  labels;  the  upper  links  have 
a  “0”  in  the  ith  bit  and  the  lower  links  have  a  “1.”  Fig.  I 
illustrates  an  ICube  network  of  size  /V  =  8  and  two  possible 
states  of  an  interchange  box.  “straight”  and  “exchange.”  Since 
this  paper  considers  only  one-to-one  and  permutation  routing, 
broadcast  states  are  not  shown. 

The  IADM  network  is  composed  of  n  stages  labeled  from 
0  to  n  -  1.  Each  stage  consists  of  3/V  connection  links  and  .V 
switching  elements.  An  extra  column  of  switches  is  appended 
at  the  end  of  the  last  stage  as  the  output  switches  and  is  referred 
to  as  stage  n.  Each  switch  j  at  stage  i  has  three  output  links 
to  switches  (j  -  2‘)  mod  .V,  j,  and  ( j  +  2’)  mod  .V  of  the 
succeeding  stage.  Each  switch  selects  one  of  its  input  links 
and  co.^iects  it  to  one  or  more  output  links.  Fig.  2  illustrates 
an  IADM  network  of  size  .V  =  3. 

In  a  multistage  interconnection  network,  the  path  connecting 
the  source  of  a  message  to  its  destination  is  determined  by 
a  routing  scheme  that  specifies  the  switching  state  of  each 


0018-9340/92J03.00  ©  1992  IEEE 


RAU  el  at.:  DESTINATION  TAG  ROUTING  TECHNIQUES  8ASED  ON  STATE  MODEL  FOR  IADM  NETWORK 


275 


straight  exchange 

Fig.  1.  The  Indirect  Binary  n-Cube  (ICube)  network  for  .Y  =  3  (according 
to  the  first  graph  model);  two  possible  states  for  each  box  are  shown  (i.e.. 
straight  and  exchange). 


Fig.  2.  The  IADM  network  for  .V  =  3  (according  to  the  first  graph  model); 
-fen,  and  odd,  switches.  0  <  /  <  2.  are  enclosed  with  bold  and  regular 
edges,  respective!  /.  The  solid  edges  (links)  show  the  [Cube  subgraph. 


switch  in  the  path.  Routing  schemes  are  considerably  simpler 
for  the  cube-type  networks  than  for  the  data  manipulator-type 
networks.  In  cube-type  networks,  the  interchange  box  at  stage  i 
needs  to  examine  the  ith  bit  of  the  binary  representation  of  the 
destination  address  of  an  incoming  message.  If  the  ith  bit  is  0, 
then  the  upper  output  of  the  box  is  taken.  If  the  ith  bit  is  1.  the 
lower  output  of  the  box  is  taken.  These  schemes  are  known  as 
destination  tag  routing  senemes  (6)  and  are  extremely  efficient 
and  simple  to  implement.  Unlike  cube-type  networks,  in  the 
IADM  and  other  data  manipulator-type  networks  there  are 
several  paths  between  any  source  a  and  destination  d  (s  <L) 
,nd  each  switching  element  has  at  least  three  switching  states. 
Previously  proposed  routing  schemes  [9],  (10),  [13]  for  the 
(ADM  network  can  be  thought  of  as  distance  tag  schemes; 
that  is.  they  require  calculation  of  the  distance  from  source 


to  destination  in  order  to  generate  routing  and  rerouting  tags. 
The  rerouting  schemes  in  these  works  are  basically  finding  an 
alternate  representation,  which  specifies  an  alternate  routing 
path,  for  the  distance. 

McMillen  and  Siegel  [9]  proposed  three  dynamic  rerouting 
techniques  for  the  IADM  network  for  avoiding  faulty  or 
blocked  ±2*  (nonstraight)  links.  The  first  and  the  second 
schemes  require  that  switches  be  capable  of  performing  two’s 
complement  and  ±2*  addition  operations,  respectively.  The 
third  scheme  requires  one  extra  tag  bit  which  is  dynamically 
updated  as  the  message  propagates  toward  the  destination. 
In  [101,  the  work  of  [9]  was  expanded,  and  a  single-stage 
look-ahead  scheme  was  proposed  to  avoid  certain  types  of 
straight  link  faults.  This  improved  scheme  also  requires  two's 
complement  operations. 

Parker  and  Raghavendra  [13]  used  redundant  number  rep¬ 
resentation  and  proposed  an  algorithm  capable  of  finding  all 
routing  paths,  which,  effectively,  are  the  redundant  number 
representations  for  the  distance  between  the  source  and  the 
destination.  Because  of  the  complexity  of  the  algorithm,  the 
cost  of  computation  is  prohibitively  large  so  that  it  is  infeasible 
to  implement  the  algorithm  in  order  to  achieve  dynamic 
routing  [19].  In  addition,  although  the  algorithm  can  generate 
all  routing  tags  for  any  distance,  there  is  no  specific  work  on 
rerouting  schemes  in  [13],  [14], 

Lee  and  Lee  [7]  proposed  signed  bit  difference  tag  and 
destination  tag  local  control  algorithms  for  the  ADM  and 
IADM  networks  that  require  no  computation  for  the  distance 
between  the  source  and  the  destination.  But  their  local  control 
algorithms  can  only  find  one  routing  path  for  each  source  and 
destination  pair.  If  the  need  for  rerouting  arises,  they  still  resort 
to  the  distance  tag  schemes  to  find  alternate  paths. 

Past  research  has  shown  interesting  relationships  between 
data  manipulator  and  cube-type  networks.  For  example,  be¬ 
cause  it  is  possible  to  embed  the  Generalized  Cube  network 
in  the  ADM  network  [1],  [17],  the  set  of  interconnections 
implementable  by  the  ADM  network  is  a  superset  of  that  of 
the  Generalized  Cube  network.  This  fact  and  the  existence  of 
multiple  paths  between  any  source  s  and  destination  d  (s  ^  d) 
in  the  ADM  network  suggests  that  the  ADM  network  can 
be  thought  of  as  a  fault-tolerant  Generalized  Cube  network. 
Analogously,  the  IADM  network  can  be  regarded  as  a  fault- 
tolerant  ICube  network.1  Since  the  permutations  realizable 
by  cube-type  networks  are  well  studied,  the  identification 
of  possible  embeddings  of  the  (Cube  network  in  the  IADM 
network  can  help  characterize  the  permutation  capabilities 
of  this  network.  A  contribution  to  the  precise  understanding 
of  these  notions  is  made  in  this  paper:  it  consists  of  the 
identification  of  a  large  number  of  distinct  subgraphs  of  the 
IADM  network  that  are  isomorphic  to  the  ICube  network. 

Section  II  of  this  paper  introduces  a  state  model  to  de¬ 
scribe  and  correlate  topologies  of  the  ICube  network  and  the 
IADM  network.  Necessary  and  sufficient  conditions  to  perform 
rerouting  in  the  IADM  network  are  derived  in  Section  III.  In 
Section  IV.  two  routing  and  rerouting  schemes  are  proposed 

1  While  topologically  equivalent,  the  ICube  and  Generalized  Cube  I/O  pons 
are  addressed  so  that  their  interrelationship  is  the  same  as  that  of  the  IADM 
and  ADM  network,  t.e..  the  input  and  output  sides  are  interchanged. 


:?6 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41.  NO.  3.  MARCH  1992 


based  on  the  theory  developed  in  Section  ill,  together  with  a 
discussion  of  their  merits  and  implementation  considerations. 
A  universal  rerouting  algorithm  is  proposed  in  Section  V, 
which  can  deal  with  any  combination  of  multiple  link  block¬ 
ages.  A  class  of  subgraphs  in  the  1ADM  network  that  are 
isomorphic  to  the  ICube  network  are  identified  in  Section  VI, 
and  it  is  shown  how  to  reconfigure  the  LADM  network  under 
certain  link  faults  to  pass  the  cube-admissible  permutations. 
Finally,  Section  VII  summarizes  the  results  presented  in  this 
paper. 

II.  State  Model  Description  for  the  ICube 
and  I  ADM  Networks 

Multistage  networks  can  be  modeled  as  graphs  by  treating 
interchange  boxes  (also  called  switching  elements)  and  links 
of  the  network  as  nodes  and  edges  of  the  graph,  respectively. 
Another  equivalent  graph  model  [1],  [8]  results  if  interchange 
boxes  are  associated,  with  edges,  and.' links- with  nodes.  Both 
models  are  exemplified  in  Figs,  1  and.3  for  the  ICube  network. 
The  LADM  network  is  shown  in»  Fig.  2  according  to  the 
first  model.  The  design  of  switches  based  on  both  models  is 
discussed  in  [11].  Clearly,  the  ICube  network  in  Fig.  3  can 
be  regarded  as  being  a  subgraph  of  the  LADM  network  in 
Fig.  2.  Henceforth,  the  second  model  is  always  assumed  when 
referring  to  the  ICube  network  (i.e.,  Fig.  2)  and  the  first  model 
is  assumed  when  dealing  with  the  LADM  network. 

With  respect  to  these  graph  models,  the  nodes  and  the 
edges  of  the  graph  refer  to  the  switches  and  the  links  of  the 
networks,  respectively.  The  number  of  switches  at  each  stage 
of  a  network  is  denoted  N  and  n  =  iog2  ^  refers  to  the 
number  of  stages.  The  switches  of  each  stage  are  labeled  from 
0  to  iV  -  L  from  the  top  to  the  bottom.  Any  integer  j  has 
a  binary  representation  joji  ■  ■  ■  ]n-u  where  j„- 1  is  the  most 
significant  bit  and  n  denotes  the  number  of  bits.  The  notation 
jr/,  means  the  bits  of  j  starting  at  jr  and  ending  at  j„  where 
r  <  3.  Bit  ]l  is  l’s  complement  of  bit  ji.  Throughout  this 
paper,  j  and  j  +  a.  where  a  is  some  constant,  are  reserved 
to  represent  labels  of  switches.  Also  modulo  .V  arithmetic  is 
assumed,  e.g.,  j  +  a  implies  ( j  ■+■  a)  mod  iV.  The  notation 
j  £  Si  is  used  to  indicate  that  a  switch  j  belongs  to  stage  i 
and  (j1  £  Si ,  j"  £  S;+1)  is  used  to  represent  a  link  at  stage 
i  joining  /  £  Si  and  j"  £  5<+ 1-  A  sequence  of  switches 
of  contiguous  stages  (/  £  Sitj"  €  5i+ i,  •  •  • ./"  €  Si+k)  is 
used  to  represent  a  path  from  f  €  S,  to  j'"  €  S,-*.*. 

Notation  and  terminology  required  for  the  characterization 
of  network  topologies  and  destination  tag  routing  schemes  are 
introduced  next.  A  switch  j  of  stage  i  is  an  even,  switch  if 
j,  =  0  and  an  oddi  switch  if  ;,  =  1.  Fig.  2  identifies  even, 
and  oildi  switches  at  different  stages  of  the  LADM  network  of 
size  .V  =  8.  Let  t,  represent  a  tag  bit  and  define  the  functions 
AC,  and  AC,  that  represent  connection  links  at  stage  i  as 

’  0  if  j  is  an  everii  switch  and  t i  =  0 

I  0  if  j  is  an  oddi  switch  and  =  L 

A^(j,r,)  ~  -2*  if  j  is  an  odd  switch  and  t,  =  0 

.  +2*  if  j  is  an  everii  switch  and  ti  =  I 

=  -AC,  O',  t,). 


Fig.  3.  The  Indirect  Binary  n-Cube  (ICube)  network  for  .V  =  8  (according 
to  the  second  graph  model). 

Also,  define  the  functions  C,(j.  t,)  =  j  +  AC,(_/,  f, )  and 
Cdj,ti)  =  j  +  AC.O,  £,).  These  definitions  imply  the  fol¬ 
lowing  lemma  of  fundamental  importance  to  the  results  of 
this  paper. 

Lemma  2.1: 

Ci(jiti)  =  jo/i-lUji+l/n—l 
C,(j,  ti)  jo/i—  lti<?i+l/n  —  l 

for  some  value  of  i/„_i  which  depends  on  j  and  t.,. 

Proof:  If  j  is  an  even,  switch  and  f,  =  0,  then  C,(j.  t, )  = 
C,(j,ti)  =_j.  If  j  is  an  oddi  switch  and  f,  =  1.  then 
C%(j,ti)  =  C,0\  ti)  =  j.  If  j  is  an  odd,  switch,  t,  =  0.  then 
Ci(j,  ti)  results  from  subtracting  1  from  j,.  Since  j  is  an  odd; 
switch,  ji  =  1,  no  borrow  is  generated  and  all  remaining  bits 
of  j  are  unchanged;  however,  C,(j,  ti)  adds  1  to  jit  changing 
the  zth  bit  to  0  and  altering  some  of  the  bits  in  positions 
i  +  1.  •  •  -n  —  1  due  to  carry  propagation.  Similar  reasoning 
applies  when  j  is  an  even;  switch  and  t,  =  1.  □ 

The  notation  and  terminology  just  introduced  can  now  be 
used  to  describe  the  networks  of  interest  in  this  paper.  The 
following  description  for  a  network  in  terms  of  AC„  AC,, 
C,  and  Ci  is  called  the  network  state  modeL 
The  ICube  network  is  composed  of  n  stages  labeled  from 
0  to  n  -  1.  Each  stage  consists  of  2 N  links  and  N  switches. 
An  extra  column  of  switches  is  appended  at  the  end  of  the 
last  stage  as  the  output  switc^s  (Fig.  3)  and  is  denoted  S„. 
A  switch  j  6  Si  is  connected  to  switches  C,(j,  t,)  €  S,-+ 1, 
for  0  <  i  <  n  -  1,  0  <  j  <  .V  -  1,  and  t,  =  0  or  t,  =  1. 
When  using  destination  tags,  switch  j  €  Si  routes  a  message  to 
switch  C,(j,di)  €  Si+ 1  where  d,  is  the  ah  bit  of  the  address 
of  the  message  destination. 

The  I  ADM  network  is  composed  of  n  stages  labeled  from  0 
to  ti-  1.  Each  stage  consists  of  a  column  of  .V  switches  and  3.V 
connection  links.  An  extra  column  of  switches  is  appended  at 
the  end  of  the  last  stage  as  the  output  switches  and  is  denoted 
S„.  A  switch  j  €  Si  is  connected  to  switches  C,(j,  t,)  € 
and  Cl(j,tl)  £  5j+i  for  0  <  t  <  n  —  1,  0  <  j  <  .V  -  1. 
and  t,  =  0  or  t,  -  1.  In  other  words,  three  links  connect 
a  switch  j  €  Si  to  the  switches  (j  -  2*),  j  and  (j  -  2*)  at 
stage  i  +  1.  Sometimes  -*-2*  and  -2*  are  used  to  represent 
links  (j  6  Si,{j  +  2‘)  €  S^,)  and  ( j  €  Si,{j-2t)  £ 
S,+i),  respectively.  The  term  a  straight  link  refers  to  link 


!AU  «  at:  DESTINATION  'AG  ROUTING  TECHNIQUES  BASED  ON  STATE  MODEL  FOR  (ADM  NETWORK 


277 


function  ACj 


Fig.  4.  The  connection  links  of  stage  i  of  the  (Cube  network  can  be  described  by  the  function  AC,.  The  connection  links  of  stage  i  of  the  IADM 

network  can  be  described  by  ihe  union  of  (he  functions  AC,  and  A?,. 


(j  6  Si,  j  6  Si+i)  and  the  term  a  nonstraight  link  refers  to 
links  ±2‘. 

According  to  the  model,  two  types  of  switches,  even,  and 
iddi  are  required  in  the  IADM  and  ICube  networks.  Fig.  4 
illustrates  the  connection  links  of  a  pair  of  event  and  odd* 
switches  for  an  ICube  and  IADM  network  of  size  iV  =  8.  The 
AC,  function  describes  the  ICube  connections.  For  the  IADM 
network,  the  connection  links  can  be  described  by  the  union 
of  the  functions  AC,  and  AC,.  In  practice,  event  and  odd* 
^witches  can  be  identical  and  easily  programmed  (at  power-up 
or  system  configuration  time)  to  behave  differently. 

There  are  two  possible  routing  behaviors  (or  states)  for  each 
switch  in  an  LADM  network.  A  switch  is  said  to  be  in  state  C  if 
the  routing  is  decided  in  accordance  with  die  function  C,(j,  f<) 
and  it  is  in  the  state  C  if  the  function  Ct(j,tt)  applies.  On 
the  whole,  the  link  on  which  a  message  is  routed  depends  on 
whether  the  switch  is  an  even*  or  odd*  switch,  in  state  C  or  C, 
and  the  value  of  tag  bit  t,.  Also  the  term  state  of  the  network 
is  used  to  denote  collectively  the  states  of  all  switches  in  the 
network. 

The  notion  of  switch  state  is  only  conceptual;  it  can  be 
implemented  by  designing  the  switches  with  actual  logic  states 
as  well  as  by  using  tags  with  n  added  bits  specifying  the  states 
of  the  switches  on  the  routing  path.  In  Section  IV,  these  and 
other  aspects  of  the  actual  implementation  of  the  proposed 
schemes  are  discussed  in  detail. 


in.  Theory  Behind  the  State-Based  destination 
Tag  Routing  Schemes 

Based  on  the  framework  developed  in  Section  II.  routing 
problems  in  the  IADM  network  are  now  examined,  (t  is 
clear  that  when  every  switch  in  the  IADM  network  is  in  the 
state  C,  the  LADM  network  behaves  like  an  ICube  network 
and,  therefore,  the  destination  address  do/n-t  can  be  used 
as  a  routing  tag,  i.e.,  U  =  d*.  More  generally,  the  following 
theorem  can  be  proven- 

Theorem  3.1.  Let  d  ^  do/n-t  be  tile  destination  in  the 
IADM  network  to  which  a  message  is  to  be  sent.  Then  t  = 


do/n-i  is  the  unique  destination  routing  tag  to  the  destination 
d  regardless  of  state  of  the  IADM  network. 

Proof:  Consider  an  arbitrary  tag  /o/n-t  and  assume  that 
the  LADM  network  is  in  an  arbitrary  state.  Let  i0/n-i  = 
fotn-i-  Then  each  switch  will  route  the  incoming  mes¬ 
sage  to  either  C,(j,ft)  or  Cl{h{^.  From  Lemma  2.1.  it  can 
be  reasoned  by  induction  that,  at  stage  i,  (C,(/.  /,))0/,t  = 
o/j  =  fo/il  at  the  last  stage,  Cn-iU*fn- 1)  = 
Cn-dj,  fn—  i )  =  fn/n-i-  Thus,  the  address  of  the  destination 
of  the  message  is  the  same  as  the  routing  tag.  This  proves  both 
the  validity  and  the  uniqueness  of  d0/n-i  as  a  routing  tag. 

□ 

It  is  implicit  in  the  reasoning  underlying  Theorem  3.1  that 
any  link  on  a  given  path  results  from  the  appropriate  choice 
of  the  state  of  the  corresponding  switch,  i.e.,  the  use  of  “link” 
AC,  {j,  tt )  results  from  setting  j  €  5<  to  state  C  and  use  of 
“link”  ACi  ( j,  U )  results  from  setting  j  €  St  to  state  C.  Thus, 
given  a  path  to  the  destination  d,  there  is  at  least  one  network 
state  for  which  the  use  of  d  as  the  destination  tag  results  in 
the  routing  of  a  message  through  that  path. 

The  implication  of  Theorem  3.1  is  that  the  use  of  a  state 
model  for  the  IADM  network  reduces  the  problem  of  finding 
alternate  routing  paths  to  that  of  controlling  the  states  of  the 
switches  in  the  network.  Capitalizing  on  this  idea,  the  follow¬ 
ing  theorems  show  how  alternate  routing  paths  can  be  found 
in  order  to  evade  blockages  in  the  network.  A  straight  link 
blockage  occurs  if  a  straight  link  on  the  routing  path  is  faulty 
or  busy.  A  nonstraight  link  blockage  is  defined  analogously. 
The  third  type  of  blockage,  called  double  nonstraight  link 
blockage,  occurs  if  both  nonstraight  output  links  of  a  switch  in 
the  routing  path  are  faulty  or  busy.  A  switch  blockage  occurs 
if  the  switch  itself  is  busy  or  faulty.  A  switch  blockage  has  the 
same  effect  as  blocking  ail  of  the  switch’s  input  links  and  can 
be  transformed  into  a  link  blockages  problem  accordingly.  The 
discussion  on  rerouting  in  this  paper  is  concerned  only  with 
link  blockages. 

Theorem  3.2:  In  the  IADM  network,  a  change  of  the  state  of 
switch  j  €  St  results  in  a  different  routing  path  to  a  destination 
d  if  and  only  if  a  nonstraight  output  link  of  j  is  used  on  the 


m 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  41,  NO.  3,  MARCH  1992 


original  routing  path  to  d.  Moreover,  the  other  nonstraight 
output  link  of  j  is  used  on  the  new  path. 

Proof:  Changing  the  state  of  j  implies  that  the  “link” 
±Ci{j,ti)  is  used  instead  of  AC,(j,  £,)  or  vice  versa.  How¬ 
ever,  if  A Ct{j,ti)  =  0  then  A C,(j,  t*)  =  0  (i.e.,  both  use  a 
straight  link)  and  vice  versa.  □ 

With  regard  to  the  rerouting  schemes  proposed  in  this  paper, 
the  implications  of  Theorem  3.2  are  twofold.  First,  the  “if” 
part  of  the  theorem  implies  that  dynamic  rerouting  for  a 
nonstraight  link  blockage  can  be  achieved  by  changing  the 
state  of  the  switch  whose  output  is  the  nonstraight  link,  which 
is  equivalent  to  rerouting  the  message  through  the  oppositely 
signed  nonstraight  link  connected  to  the  same  switch.  Thus,  the 
same  subset  of  destinations  is  reachable  from  the  two  switches 
whose  input  links  are  the  two  oppositely  signed  nonstraight 
links.  Second,  the-  “only  if”  part  of  the  theorem1  implies  that 
dynamic  rerouting  for  a  straight  link  blockage  is  impossible. 
This  is  true  in  general  since  every  routing  path  in  the  LADM 
network  can  be  the  result' of  setting  the  network  to  some  state. 
Moreover,  if  a  path  from  stage  i '  to  stage  i"  consists  of  ail 
straight  links  connecting  j  6  S<  and  j  €  Si+i,  i'  <  i  <  i", 
then  there  exist  no  alternate  routing  paths  from  j  €  S;<  to 
j  €  Si"  for  otherwise  there  would  exist  an  alternate  routing 
path  branching  from  j  €  and  ending  at  the  destination.  The 
only  resort,  if  any  at  all,  to  bypass  the  straight  link  blockage  is 
to  backtrack  to  a  switch  connected  to  a  nonstraight  link  on  the 
routing  path  at  some  preceding  stage  and  to  reroute  from  that 
switch.  It  remains  to  show  that  an  alternate  routing  path  always 
exists,  provided  that  such  a  nonstraight  link  exists.  In  fact, 
the  existence  of  an  alternate  routing  path  partly  results  from 
Theorem  3.2,  as  stated  in  the  next  theorem.  Fig.  5  illustrates 
the  situation  in  Theorem  3.3  for  which  a  proof  is  provided 
in  [22]. 

Theorem  3.3:  Consider  a  routing  path  in  the  IADM  network 
to  a  destination  d  that  contains  a  blocked  straight  link  at  stage 
«.  There  exists  at  least  one  network  state  which  results  in 
an  alternate  routing  path  that  avoids  the  same  straight  link 
blockage  at  stage  i  if  and  only  if  the  original  routing  path  to  d 
contains  a  nonstraight  link  at  stage  i-  k  for  some  k,  i  >  k  >  0. 

Previous  work  [7],  [9],  [13]  implies  only  the  “if”  part  of 
the  theorem,  i.e.,  the  possibility  of  using  nonstraight  link 
of  opposite  sign  in  order  to  reroute  a  message  in  the  case 
of  a  nonstraight  link  failure.  However  the  “only  if”  part  of 
the  theorem  also  implies  that,  in  addition,  it  is  not  possible 
to  devise  a  new  rerouting  scheme  capable  of  avoiding  a 
backtracking  (or  look-ahead)  mechanism  in  order  to  deal  with 
straight  link  blockages. 

From  Theorem  3.2,  (for  a  given  source/destination  pair) 
if  the  straight  output  link  of  a  switch  is  on  some  routing 
path,  both  nonstraight  output  links  of  the  switch  cannot  be 
used  for  routing;  if  one  of  the  nonstraight  output  links  of  a 
switch  is  on  some  routing  path,  the  other  nonstraight  link  of 
the  switch  is  also  on  another  routing  path  and  the  straight 
link  of  the  switch  cannot  be  used  for  routing.  So,  for  a  given 
switch,  the  output  link  blockages  that  affect  paths  from  a  given 
source  to  a  given  destination  can  only  be  1)  a  nonstraight  link 
blockage,  2)  a  straight  link  blockage,  or  3)  a  double  nonstraigbt 


(j*2^)€SM 

Fig.  5.  Rerouting  for  a  straight  link  blockage  in  ( j  €  S,.j  €  Si+i ).  Path 
((J  +  2‘-‘)  6  S,.k.j  6  ■  •  .j  6  5,+t )  is  a  segment  of  the  original 

path:  <(.+•  2‘“fc  6  Si-k.(j  +  2-k+' )  6  6  S.+i,  and 

(0  +2-k)  €  +  €  5,_k+l. ••••(>  +2‘)  €  SiAj 

‘ )  g  5i+, )  are  the  rerouting  paths  for  it. 

link  blockages  Theorem  3.2  can  be  used  to  avoid  case  1)  a 
nonstraight  link  blockage  and  Theorem  3.3,  case  2)  a  straight 
link  blockage.  If  case  2)  occurs,  then  Theorem  3.2  cannot  be 
used  to  find  a  rerouting  path.  A  backtracking  scheme  proposed 
later  in  Corollary  4.2  based  on  Theorem  3.3  can  be  adapted 
to  overcome  this  type  of  blockage.  The  adapted  backtracking 
scheme  is  based  on  Theorem  3.4,  which  is  illustrated  in  Fig.  6. 
The  proof  of  Theorem  3.4  is  provided  in  [22]. 

Theorem  3.4:  Consider  a  routing  path  in  the  IADM  network 
to  a  destination  d  that  contains  a  switch  at  stage  i  whose  both 
nonstraight  output  links  are  blocked.  There  exists  at  least  one 
network  state  which  results  in  an  alternate  routing  path  that 
avoids  the  same  blocked  nonstraight  links  at  stage  i  if  and 
only  if  the  original  routing  path  to  d  contains  a  nonstraight 
link  at  stage  i  -  k  for  some  k.  i  >  k  >  0. 

IV.  State-Based  Routing  and  Rerouting  Schemes 

In  this  section,  routing  and  rerouting  schemes  are  discussed 
based  on  the  theory  developed  in  Section  III.  As  mentioned 

:  Physically  it  is  possible  to  have  any  combination  of  btocfcagea  of  the 
output  links  of  a  given  switch.  However,  the  poasible  routing  path*  for  a  given 
source/ destination  pair  can  be  affected  by  either  a  straight  link  blockage  or  a 
double  nonstraigbt  link  blockage  in  a  given  switch  but  never  both  types  of 
blockage. 


<AU  ft  al.:  DESTINATION  TAG  ROUTING  TECHNIQUES  BASED  ON  STATE  MODEL  FOR  IADM  NETWORK 


279 


(y-Ttes.-M 


(/+2‘)es,-  (>+2‘)6S,Vl 

:  ig.  6.  Rerouting  for  a  double  nonstraight  links  blockage  in  (j  6  5, .  ( j  —  l‘ ) 
•:  5,*,)  and  (j  6  S,.{j  +•  ’*)€  S.+  ,  )•  Path  ((j  +  2— fc)  6  S,.kAj  + 
.-v-M)  g  .••■•(>+  2')  €  Si.  (j  +  2')  €  S.-m )  is  a  rerouting 

'ath  for  both  paths  |  (y+i*-')  €  S.-k.y  €  Sj-tr+t.  ••  •  .j  6  S,+  I.(j  —  2* ) 
■:  5j+i)  and  f(j  +2*-)  €  6  5._fc+1.-  -.7  €  S.-m- O' +  >*) 

jar  tier,  the  novelty  of  the  ideas  in  this  paper  lies  in  the  state 
node!  of  the  routing  behavior  of  each  switch.  In  previously 
proposed  approaches,  routing  is  determined  solely  by  tag  bits. 
\ccording  to  the  state  model,  the  switching  action  of  each 
letwork  element  is  conceptually  determined  by  its  relative 
position  (i.e.,  an  even*  or  oddi  switch),  its  state  (i.e..  C  or  (7) 
md  a  destination  tag  bit  (i.e.,  0  or  1)  (Fig.  4).  This  conceptual 
eparation  of  routing  information  makes  it  possible  to  devise 
ne  simple  routing  schemes  described  in  this  section. 

In  the  first  scheme,  each  switch  is  initially  set  up  to  behave 
is  an  odd,  or  even,  switch.  In  addition,  each  switch  can 
ivnamically  be  set  to  one  of  the  logical  states  C  or  C.  In  other 
vords.  this  scheme  corresponds  to  a  direct  implementation  of 
ne  conceptual  view  of  switch  states.  Destination  tags  are  used 
md.  according  to  Theorem  3.1.  the  state  of  the  network  is 
ransparent  to  the  sender  Sf  the  message  since  it  only  affects 
he  path  of  the  message  and  not  its  destination.  Consequently, 


rerouting  is  also  transparent  in  the  sense  that  it  results  from 
a  change  in  the  network  state.  In  practice,  the  implementation 
can  be  such  that,  for  instance  state  C  (or  C)  is  used  as 
the  default  state  for  each  switch  in  the  IADM  network  and 
the  switch  regarcs  the  other  nonstraight  link  as  a  spare  link 
for  rerouting;  if  a  nonstraight  blockage  is  detected,  then  the 
switch  changes  state  to  C  (or  C)  so  that  the  spare  link  is  used 
instead.  This  scheme  is  called  the  Self-Repairing  State-Based 
Destination  Tag  ( SSDT )  scheme. 

Rerouting  is  useful  not  only  when  one  nonstraight  link 
in  a  switch  is  faulty  or  busy,  but  also  if  both  nonstraight 
links  are  busy.  For  example,  when  considering  a  packet 
switching  environment,  rerouting  may  be  desirable  as  a  means 
of  balancing  the  message  load  throughout  the  network.  The 
scheme  proposed  here  is  well  suited  for  this  purpose.  Assume 
that  each  nonstraight  link  has  an  associated  buffer  (queue). 
When  both  nonstraight  links  are  busy  due  to  message  traffic 
congestion,  a  switch  can  choose  which  nonstraight  buffer  to 
assign  a  message  to  (i.e.,  which  state  to  associate  with  that 
queued  message),  based  on  the  number  of  messages  present 
in  the  buffers  in  order  to  evenly  distribute  the  message  load 
to  the  nonstraight  links. 

The  proposed  SSDT  scheme  has  the  advantages  that  it  uses 
simple  n-bit  destination  tags  and  is  capable  of  rerouting  mes¬ 
sages  when  blockages  occur  in  nonstraight  links.  In  addition, 
rerouting  of  a  message  is  transparent  to  its  sender  since  the 
path  of  the  message  is  determined  by  the  state  of  the  network. 
For  a  given  destination  tag,  the  routing  behavior  of  each  switch 
on  a  possible  path  is  determined  by  the  state  of  the  switch, 
i.e..  the  SSDT  scheme  is  fully  distributed  and  rerouting  is 
done  dynamically.  Each  switch  requires  a  negligible  amount 
of  extra  hardware  for  the  detection  of  blocked  links  and  the 
representation  of  two  possible  states. 

The  second  scheme  is  called  the  Two-bit  State-Based  Des¬ 
tination  Tag  (TSDT)  scheme  and  it  uses  2n-bit  routing  tags, 
which  specify  both  the  destination  of  the  message  and  the 
states  of  switches  on  the  corresponding  path.  The  TSDT 
scheme  has  the  advantage  that  rerouting  is  possible  when 
blockages  occur  for  straight  as  well  as  nonstraight  links. 

As  with  the  first  scheme,  the  TSDT  scheme  assumes  that 
each  switch  is  appropriately  initialized  to  behave  as  an  oddi 
or  everii  switch.  Each  “digit”  of  the  routing  tag  is  represented 
by  two  bits  bn~,  and  6<,  called  the  state  bit  and  the  destination 
bit,  respectively.  For  this  scheme,  the  state  of  a  switch  of  stage 
i  is  specified  by  bn+, :  if  bn+,  =  0.  the  switch  is  in  state  C  and 
if  6n+I  =  1,  the  switch  is  in  state  C.  For  all  i,  0  <  i  <  n  —  1. 
b{  =  d\.  In  general,  if  j  is  an  even *  switch,  6,;6n+1  =  00 
and  b,bn+,  =  01  direct  the  message  through  a  straight  link. 

=  10  through  link  +2*  and  bibn-t-,  =  11  through  link 
-2‘,  if  j  is  an  oddi  switch  Mr.*,  =  10  and  &;&„+.  =  11  direct 
the  message  through  a  straight  link.  bibn+,  =  01  through  link 
f-2’  and  =  00  through  link  -2V  In  general,  given  a 

switch,  the  destination  bit  specifies  use  of  a  straight  link  or  a 
nonstraight  link  while  the  state  bit  determines  the  choice  of  the 
positive  or  the  negative  link  (if  the  chosen  link  is  a  nonstraight 
link).  Since  state  information  is  carried  by  tne  routing  tag, 
switches  are  not  required  to  determine  and  remember  their 
own  states,  i.e..  the  design  of  the  switches  does  not  need  to 


280 


IEEE  TRANSACTIONS  ON  COMPUTERS.  20L  4t.  NO.  3.  MARCH  19<J2 


implement  the  logic  states  C  and  C. 

From  Theorem  3.2,  a  nonstraight  link  blockage  at  stage  i  can 
be  bypassed  conveniently  by  complementing  the  zth  state  bit 
while  the  destination  bits  remain  unchanged.  For  convenience 
of  reference,  this  is  restated  in  terms  of  the  TSDT  scheme  as 
Corollary  4.1  below. 

Corrollary  4.1:  Let  6„/2n-i  and  b'n/2n_l  be  the  state  bits 
of  the  routing  tag  and  the  rerouting  tag,  respectively,  for  the 
IADM  network.  In  order  to  bypass  a  nonstraight  link  blockage 
at  stage  i,  state  bit  bn+l  needs  to  be  change  to  6n+i.  That  is, 

^«/2n  —  1  =  l^i4-i6n4-«4-l/2n  — 1-  ^ 

Fig.  7  illustrates  an  example  of  routing  from  s  =  1  to  d  =  0 
in  an  IADM  network  of  size  N  =  8.  Let  60/s  =  000000 
be  the  routing  tag  and  6'0/s  and  6'0'/5  denote  the  rerouting 
tags.  The  original  tag  60/5  =  000000  specifies  the  path 
(1  €  So.O  £  St,0  €  52,0  6  53).  If  (1  £  5o,0  £  S\) 
is  blocked,  the  rerouting  tag  b'0/5  =  000100  is  obtained 
by  complementing  63,  and  link  (1  £  5o,2  £  5i)  is  used  for 
rerouting.  This  tag  specifies  the  path  (1  £  5o,2  £  5i,0  £ 
52,0  €  53).  If  (2  £  5i,0  £  S^)  is  also  blocked,  the  rerouting 
tag  f>0/s  =  000110  results  from  complementing  b\,  and  link 
(2  £  5i,  4  £  52)  is  used  for  rerouting.  This  tag  specifies  the 
path  (1  £  5o,2  £  5i ,  4  £  52, 0  £  5s ). 

As  discussed  in  Section  III.  a  straight  link  blockage  and  a 
double  nonstraight  link  blockage  cannot  be  overcome  easily; 
implementing  a  backtracking  (or  look-ahead)  mechanism  is  a 
must  in  order  to  evade  these  types  of  blockages.  Since  all  links 
in  the  routing  path  from  stage  i  -  A:  +  i  to  stage  1  consist  of 
only  straight  links,  backtracking  of  at  least  k  stages  is  required 
to  find  the  switch  from  which  an  alternate  routing  path 
branches.  That  is,  at  least  k  state  bits  need  to  be  considered 
for  change.  Due  to  the  similarity  between  Theorems  3.3  and 
3.4,  the  TSDT  schemes  for  finding  the  rerouting  paths  from 
Theorems  3.3  and  3.4  are  exactly  the  same,  which  is  stated  as 
Corollary  4.2  (see  [22]  for  the  proof). 

Corollary  4.2:  Let  bn/2n-i  and  b'n,2n_l  be  the  state  bits 
of  the  routing  tag  and  the  rerouting  tag,  respectively,  for  a 
source/destination  pair  in  the  IADM  network.  Let  i  -  k  be  the 
largest  stage  number  for  i  >  k  >  0  such  that  a  switch  at  stage 
i-k  is  connected  to  a  nonstraight  link  on  the  routing  path.  In 
order  to  bypass  a  straight  link  blockage  or  a  double  nonstraight 
link  blockage  at  stage  i,  only  state  bits  6n+(l_jt)/n+I_l  need 
to  be  changed:  a)  6/„/n+(t_l)  =  bn/nMi-k).ldi.k/i.l  if  the 
nonstraight  link  at  stage  i-k  of  the  original  path  is  link  -2,_fe. 
and  b)  6/ri/ri4.(i_l)  —  bn/n+(i-k)-idi-k/i-i  if  di®  nonstraight 
link  at  stage  i  —  k  of  the  original  path  is  link  +2*”*.  The  state 
bits  2n— t  have  arbitrary  values  in  both  cases. 

The  example  in  Fig.  7  can  be  used  to  illustrate  the  TSDT 
scheme  for  l)  a  straight  link  blockage  and  2)  a  double 
nonstraight  link  blockage.  1)  Again  the  tag  60/5  =  000000 
specifies  a  path  (1  £  5q,0  £  5i,0  £  52,0  £  53).  If  the 
straight  link  (0  €  5i .  0  €  5>)  is  blocked,  the  rerouting  tag  can 
be  000110  which  specifies  path  (1  £  5o,2  €  5i.  4  £  52,0  £ 
53)  by  having  1^34.2  =  dodi  634-2  =  HO.  Since  state 

bits  !  634.2  can  be  arbitrary,  000100.  for  example,  is  also 
a  valid  rerouting  tag;  it  specifies  path  (1  £  So, 2  €  Si,0  € 
S2,0  £  53).  2)  Let  the  tag  60/s  =  000110  specify  a  path 


STAGE  0  1  '  2  X 

Fig.  7.  Ail  routing  paths  from  1  6  So  to  0  6  53  in  an  IADM  network  of 
size  .V  =  8. 

(1  £  So, 2  £  Si,  4  £  s2,0  €  S3).  If  both  nonstraight  output 
links  of  4  £  S2  are  blocked,  the  rerouting  tag  ^0/5  can  be 
000100  which  specifies  path  (1  €  So. 2  £  Si,0  £  S2,0  £  5-j j 
by  having  b'^Qb'^^b'^^  —  b2+odid2.  Since  state  bits  6'1+., 
can  be  arbitrary,  000101  is  also  a  valid  rerouting  tag  which 
also  specifies  the  same  path. 

The  rerouting  path  computed  from  Corollary  4.2  is  blockage- 
free  from  stage  0  to  stage  i.  While  the  rerouting  path  is 
different  from  the  original  routing  path  from  stage  1  -  k  to 
stage  i,  the  routing  path  from  stage  0  to  i  -  k  -  1  remains 
the  same.  This  results  from  the  fact  that  backtracking  always 
proceeds  backward  along  the  original  path  until  it  stops  at 
stage  i  —  k,  and  the  rerouting  path  only  changes  course  from 
stage  i  -  k  onwards.  Although  state  bits  5n_t_1/2n— 1  remain 
unchanged,  the  routing  path  from  stage  t  to  n  -  1  may  still  be 
altered  due  to  the  changes  from  stage  t  -  k  to  i.  For  example, 
in  Fig.  5.  the  switch  on  the  original  routing  path  at  stage  i  +  1 
is  j  £  5j+ 1  whereas  the  switch  on  we  rerouting  path  at  stage 
i  +  1  may  be  (j  4-2,+1)  £  Si+l,  which  may  further  induce 
changes  at  higher-order  stages. 

In  the  TSDT  scheme,  the  tag  can  be  computed  by  the 
message  sender  which  is  assumed  to  know  the  location  of 
faulty  links  and  switches  in  the  network.  Thus,  rerouting  is 
transparent  to  the  switches  in  the  sense  that  the  tag  computed 
by  the  sender  of  the  message  simply  avoids  the  usage  of  faulty 
links  and  switches.  Therefore,  switches  do  not  require  any 
extra  hardware  for  rerouting  purposes.  It  is  assumed  that  there 
exists  a  fault  detection  and  location  mechanism  and  a  process 
that  maintains  a  list  of  faulty  switches  and  links.  Changes  to 
the  list  are  broadcasted  by  this  process  to  the  senders  (the  exact 
broadcast  method  depends  on  system  implementation).  For 
better  performance,  these  lists  should  be  in  fast  access  memory 
(e.g.,  cache).  An  alternative  is  to  implement  dynamic  rerouting 
for  the  TSDT  scheme.  Since  backtracking  is  indispensable  for 
avoiding  a  straight  link  blockage,  it  is  required  that  each  switch 
can  detect  the  inaccessibility  of  any  output  port  (co.inected  to 


RAU  et  aL :  DESTINATION  TAG  ROUTING  TECHNIQUES  3ASED  ON  STATE  MOOEL  FOR  IADM  NETWORK 


281 


a  switch  at  the  next  stage)  and  signal  the  presence  of  the 
d lockage  back  to  the  switches  of  previous  stages  [10],  [12]. 
Whether  rerouting  is  done  by  the  sender  or  dynamically  is  an 
implementation  decision  which  depends  on  how  many  stages 
of  backtracking  are  allowed.  When  the  sender  computes  the 
tag,  it  must  be  able  to  identify  and  track  the  switches  and  links 
>n  the  corresponding  routing  and  rerouting  paths  (the  next 
paragraphs  explain  how  this  is  done).  If  any  of  the  switcnes  or 
links  in  the  path  is  known  to  the  sender  as  being  faulty,  then 
the  sender  computes  another  tag  by  changing  the  state  bits  as 
described  in  Section  V. 

Locating  the  switches  on  the  routing  path  is  straightforward. 
For  a  given  source  s  and  a  destination  d,  the  initial  routing 
path  can  be  specified  by  setting  state  bits  6„/2n-i  =  0n/2n-t 
(a  string  of  n  0’s),  equivalent  to  setting  every  switch  in  the 
IADM  network  to  state  C.  Then  every  switch  on  the  original 
path  has  label  cfo/i—  i  €  Si,  0  <  t  <  n  —  1,  since  now 
the  LADM  network  functions  like  an  ICube  network  [6],  [15]. 

To  find  the  switches  on  the  rerouting  path,  let  j  €  Si  be 
the  switch  whose  output  link  is  blocked.  First  consider  the 
case  where  the  blocked  link  is  a  nonstraight  link.  It  may  be 
a  l)  positive  or  2)  negative  link.  In  case  1)  the  switch  at 
>tage  i  +  1  reached  by  the  positive  link  is  (j  4-21)  € 
and,  from  Corollary  4.1,  rerouting  can  be  done  through  switch 
( i  -  2‘)  €  S1+l.  In  case  2)  the  switch  at  stage  i+ 1  reached  by 
the  negative  link  is  ( j  -  2‘)  6  Sj+i  and,  from  Corollary  4.1, 
rerouting  can  be  done  through  switch  ( j  +  2*)  €  Si+i-  Let 
the  switch  at  stage  i  +  l  on  the  rerouting  path  be  uro/n-t-  The 
aate  bits  bnH.(l+l)/n~i  remain  intact  (equal  to  0’s)  because  it 
corresponds  to  having  every  switch  from  stage  i  +  1  to  n  -  1 
remain  in  state  C  so  that  the  IADM  network  from  stage  i  +•  1 
to  a  -  1  can  emulate  the  ICube  network  from  stage  i  4-  1 
to  n  —  1.  Thus,  the  bits  (,  i  4-  1  <  l  <  n  —  1,  of  the  label 
if  a  switch  on  the  rerouting  path  are  From  Lemma 

2.1.  bits  0  to  l  -  1,  1  <  l  <  i  4-  1,  of  the  label  of  a  switch 
on  a  path  to  destination  d*)/n-\  must  be  L .  Hence,  the 
switch  on  the  rerouting  path  from  stage  i  4-  1  to  n  —  1  has 
label  do/i-t^i/n-t.  i  4-  1  <  I  <  n  —  1. 

Next  consider  the  case  where  the  blockage  of  j  6  Si  is  a 
straight  link  blockage  or  a  double  nonstraight  link  blockage 
-o  that  backtracking  is  necessary.  There  are  two  subcases 
:or  each  type  of  blockage:  a)  the  nonstraight  link  found  in 
backtracking  is  a  negative  link  and  b)  it  is  a  positive  link.  Here 
mly  subcase  a)  of  the  straight  link  blockage  is  considered;  the 
ither  cases  can  be  dealt  with  similarly.  From  the  proof  of 
Corollary  4.2  (case  a)  only),  the  switch  on  the  rerouting  path 
s  (j  4-2*)  €  St,  i  —  k  <  l  <  t.  The  switch  of  stage  i4-l  on  the 
^routing  path  is  j  €  5,*i  if  6'a+l  =  0  and  j  €  S,+i  is  an  oddi 
witch  or  if  6'a+l  =  l  and  j  6  5;^ i  is  an  even*  switch,  and 
s  (j  4  2,+1)  €  SWi  if  6'„+t  =  0  and  ;  €  S,-* i  is  an  even, 
witch  or  if  6'„+l  =  1  and  j  €  5<+i  is  an  oddi  switch.  The 
dentification  of  switches  on  the  rerouting  path  from  stage  i4- 1 
i  >  n  -  1  is  done  as  in  the  case  of  a  nonstraight  link  blockage 
Inscribed  above. 

The  blocked  link  can  be  represented  by  the  two  switches 
omed  by  the  link.  Since  eyery  switch  on  the  original  routing 
'aril  and  the  rerouting  paths  can  be  easily  identified  as  de¬ 
unbed  above,  it  can  be  readily  determined  whether  or  not  the 


blocked  link  is  on  the  current  path. 

In  summary,  for  both  SDT  schemes,  the  binary  represen¬ 
tation  of  the  destination  address  can  be  used  directly  as  the 
routing  tag.  In  the  SSDT  scheme,  rerouting  tags  are  not  needed 
and  in  the  TSDT  scheme,  rerouting  tags  result  from  simple 
bit  complementing  operations.  In  terms  of  complexity  of  the 
computation  for  a  rerouting  tag,  the  SSDT  scheme  and  the 
TSDT  scheme  for  one  instance  of  nonstraight  link  blockage 
require  time  x  space  complexity  0(1);  an  improvement  over 
previous  proposed  schemes  [9]  dealing  with  rerouting  for  a 
nonstraight  link  blockage  that  require  time  x  space  com¬ 
plexity  O(logiV).  In  [10]  a  single-stage  look-ahead  scheme 
for  rerouting  of  a  straight  link  blockage  was  proposed;  it 
requires  use  of  two’s  complement  to  compute  the  positive 
and  negative  dominant  tags  so  that  the  scheme  has  time  x 
space  complexity  of  0(log:V).  Note  that  the  single-stage 
look-ahead  rerouting  scheme  is  valid  only  for  some  cases  of 
the  straight  link  blockage;  it  cannot  be  applied  to  any  case 
of  the  straight  link  blockage.  From  Corollary  4.2,  fc- stage 
backtracking  is  needed  for  a  straight  link  blockage  and  k 
bits  of  the  state  bits  need  to  be  changed;  thus,  the  complexity 
of  the  TSDT  scheme  for  a  nonstraight  link  is  0(k).  If  only 
single-stage  backtracking  (corresponds  to  single-stage  look¬ 
ahead)  is  necessary,  rerouting  can  be  done  dynamically  and 
the  complexity  is  0(1),  an  improvement  over  the  scheme  in 
[10]. 

V.  A  Universal  Rerouting  Algorithm 
for  Multiple  Blockages 

The  TSDT  scheme  can  be  applied  to  not  only  one  instance 
of  some  blockage,  but  also  can  be  applied  repetitively  each 
time  a  new  blockage  is  encountered  as  the  message  propagates 
along.  This  section  considers  the  derivation  of  an  algorithm  to 
deal  with  any  case  of  multiple  blockages.  The  backtracking 
schemes  proposed  in  Corollary  4.2  find  a  rerouting  path  for  a 
straight  link  blockage  and  a  double  nonstraight  link  blockage. 
Nevertheless,  it  is  possible  that  blockages  also  exist  on  the 
rerouting  path;  then  further  backtracking  to  a  lower-order 
stage  is  needed.  Since  this  phenomenon  can  recur,  repeated 
backtracking  may  be  necessary  due  to  blockages  on  the 
rerouting  paths.  The  algorithm  BACKTRACK  described  next 
performs  iterated  backtracking  to  find  an  alternate  routing  path. 
It  underlies  a  universal  rerouting  algorithm  (called  REROUTE) 
to  be  shown  later  that  can  find  a  routing  path,  if  there  exists 
any,  to  bypass  multiple  blockages  in  the  network. 

The  inputs  to  algorithm  BACKTRACK  are  the  current 
routing  path  P,  the  stage  number  i  where  a  blockage  occurs, 
and  state  bits  b'n/2n_x  representing  path  P.  The  algorithm 
returns  updated  values  of  the  state  bits  b'n/2n_l  which  specify 
a  rerouting  path  that  is  blockage-free  from  stage  0  to  stage  i  if 
such  a  rerouting  path  exists,  or  returns  FAIL  if  the  blockages 
on  the  current  routing  path  and  the  rerouting  paths  eliminate 
the  possibility  of  communication  between  the  source  and  the 
destination.  It  is  assumed  that  the  blockage  on  the  original 
routing  path  at  stage  i  is  a  straight  link  blockage  or  a  double 
nonstraight  link  blockage  and  j  €  Si  is  the  switch  whose 
output  links  are  the  blocked  links.  Informal  explanations  for 


282 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41.  NO.  3,  MARCH  1992 


the  algorithm  will  be  given  following  the  algorithm  and  the 
correctness  proof  of  this  algorithm  can  be  found  in  [22]. 

Algorithm  BACKTRACK  (and  REROUTE)  presumes  ex¬ 
istence  of  the  knowledge  of  alL  blockages  in  the  network. 
The  network  controller  is  responsible  for  collecting  this  infor¬ 
mation  and  maintaining  a  global  map  of  blockages,  which  is 
accessible  to  every  sender  of  the  messages  in  order  to  compute 
a  path  to  avoid  the  blockages.  In  addition,  since  it  may  take 
several  iterations  before  a  blockage-free  path  can  be  found  or  it 
can  be  concluded  that  no  blockage-free  paths  exist,  the  sender 
of  the  message  needs  to  maintain  and  update  the  locations  of 
switches  on  the  rerouting  path  in  each  iteration. 

Algorithm  BACKTRACK.  ( P,i ,Vn/2n-i) 

0:  Let  q  be  the  stage  number  where  a  blockage  occurs. 

7  —  *• 

1:  P  =  the  current  routing  path. 

Backtrack  on  path  P  from  stage  q  to  find  a  nonstraight 
link.  If  no  nonstraight  link  exists- at  any  preceding  stage, 
retum(FAIL);  otherwise  assign-  to  r  the  stage  number 
where  tne  first  nonstraight  output  link  is  found. 

2:  (f  the  nonstraight  (ink  at  stage  r  on  the  routing  path  is 
+2r,  assign  flag  linkfound  value  0;  if  it  is  -2r,  assign 
linkfound  value  1. 

3:  If  linkfound  =  0,  6,a/2n.l  -  b'n/n+r_vd 

°n+i//2n— 1’ 

if  linkfound  =  1.  V/2n_1  -  b'n/n+r_Jir,i-i 

"«+,,/2n-r 

4a:  This  step  applies  only  when  the  blockage  at  stage  q  on 
path  P  is  a  straight  link  blockage. 

If  linkfound  =  0,  set  =  dn,  if  ((j  -  2")  € 
S,n  (j  -  2,+l)  €  S7+1)  is  blocked,  change  6'n+<?  to 
Tln  \  furthermore,  if  (( j  -  2q)  £  5,,  j  £  5,+i)  is  also 
blocked,  retum(FAIL).  If  linkfound  =  1,  set  6'n+(J  =  d7; 
if  ((j+ 2*)  €  5,,  (j+2,+1)  £  57+i)  is  blocked,  change 
Vn+q  to  furthermore,  if  (( j  +  2*)  €  Sq,j  £  S7+l) 
is  also  blocked.  return(FAIL). 

4b:  This  step  applies  only  when  the  blockage  at  stage  q  on 
path  P  is  a  double  nonstraight  link  blockage. 

If  (( j  -  2*)  £  5,,(j  -  2’)  €  S7+1)  is  blocked  for 
linkfound  =  0,  or  ((j  -l-  2’)  £  S7,  (j  +  2’)  £  5,+i)  is 
blocked  for  linkfound  —  1,  retum(FAIL). 

5:  Let  Q  denote  the  part  of  the  rerouting  path  (specified 
by  the  tag  in  step  3)  from  stage  r  +  1  to  q  from  step  3. 
If  linkfound  =  0,  Q  =  ((j  -  2r+l)  £  5r+1,  •  •  •  •  (j  - 
2<-1)  €  5,-i,  (j  -  V)  £  5,);  if  linkfound  =  1,Q  = 
((j  +  2r+1)  €  5r+1,-.-.(j  +  2*”1)  €  5,-i, 0+2’)  € 
5,). 

If  a  blockage  occurs  on  path  Q,  retum(FAIL). 

6:  If  linkfound  =  0,  and  (( j-2r )  6  Sr,  (j-2r+l)  £  5r+l) 
is  blocked,  or  if  linkfound  =  1  and  ((j  +  2r)  £ 
SrAj  +  2r+1)  €  Sr+l)  is  blocked,  go  to  step  7;  else 
return(6/n/2n_l). 

7 ■  j  ~  j  +  2r.q  —  r. 

3:  Backtrack  on  path  P  from  stage  q  to  find  a  nonstraight 
link.  If  no  nonstraight  link  exists  at  any  preceding  stage, 
retum(  FAIL);  otherwise  assign  to  r  the  sage  number 
where  the  first  nonstraight  output  link  is  found. 


9:  If  linkfound  =  0  and  the  nonstraight  link  at  sage  r  is 
-2r,  or  if  linkfound  =  1  and  the  nonstraight  link  at 
sage  r  is  +2r,  retum(FAIL). 

10:  If  linkfound  =  0,  b'^^  -  6'n/n+r_ldr/7_1 

'^n+q/2n-l> 

if  linkfound  =  1,  6/n/2n_1  -  b'n/n+r_ldr/q_l 

^n+ij/2n  —  l' 

Go  to  step  4b. 

Step  0  is  the  initialization  step.  From  Theorems  3.3  and  3.4, 
an  alternate  path  exists  for  avoiding  a  straight  link  blockage  or 
a  double  nonstraight  link  blockage  if  and  only  if  there  exists 
a  nonstraight  link  at  some  sage  preceding  sage  r;  step  1  of 
the  algorithm  searches  backward  for  such  a  nonstraight  link.  If 
not  found,  it  results  in  premature  termination  of  the  algorithm, 
reflecting  the  fact  that  no  alternate  paths  for  rerouting  exist. 
Step  2  is  used  to  differentiate  the  cases  when  the  nonstraight 
link  at  sage  r  found  in  the  first  backtracking  is  a  positive  link 
and  when  it  is  a  negative  link;  flag  linkfound  is  assigned  0  for 
the  former  and  1  for  the  latter.  If  a  nonstraight  link  exists  at 
some  sage  preceding  the  blockages,  iii  step  3,  Corollary  4.2 
is  applied  to  And  the  sage  bits  specifying  the  rerouting  path: 
cases  a)  and  b)  in  Corollary  4.2  correspond  to  linkfound  =  1 
and  linkfound  =  0.  respectively,  and  q  and  r  correspond  to  i. 
and  i  -  k,  respectively. 

Steps  4a  and  4b  deal  with  the  link  blockage  at  stage  q  on  the 
rerouting  path  computed  in  step  3.  If  the  blockage  of  a  switch 
at  stage  </  on  path  P  is  a  straight  link,  the  possible  rerouting 
links  at  stage  q  are  two  nonstraight  links.  In  step  4a  the  default 
link  is  a  negative  link  if  linkfound  =  0  and  a  positive  link  if 
linkfound  -  1.  If  the  default  link  is  blocked,  step  4a  attempts 
to  reroute  the  message  through  the  other  nonstraight  link.  If 
both  nonstraight  links  are  blocked,  there  exist  no  blockage-free 
paths.  Step  4b  applies  if  the  blockage  of  a  switch  at  sage  q 
on  path  P  is  a  double  nonstraight  link  blockage.  The  rerouting 
path  must  use  a  straight  link  at  sage  q.  If  it  is  also  blocked, 
no  blockage-free  path  exists. 

Step  3  checks  blockages  from  sage  r  +  1  to  sage  q  -  1 
on  the  rerouting  path;  if  any  blockage  falls  on  Q,  there  exists 
no  blockage-free  path.  In  step  6,  if  the  blockage  falls  in  the 
link  of  stage  r  on  the  rerouting  path,  further  backtracking  is 
necessary.  Otherwise  (no  blockages  on  the  rerouting  path), 
the  algorithm  terminates  with  the  sate  bits  specifying  the 
rerouting  path.  Step  7  updates  the  sage  number  q  and  the 
switch  label  j  where  a  blockage  on  the  rerouting  path  occurs, 
initiating  a  new  iteration  of  backtracking.  Step  8  is  the  same 
as  step  1,  searching  backward  at  lower-order  sages  again 
for  a  nonstraight  link.  Step  9  of  the  algorithm  dicrates  that 
if  the  encountered  nonstraight  link  in  the  flrst  iteration  of 
backtracking  is  a  positive  (or  negative)  link,  the  nonstraight 
link  found  in  each  subsequent  iteration  of  backtracking  must 
be  also  a  positive  (or  negative)  link;  otherwise  no  blockage- 
free  paths  exist.  If  the  condition  in  step  9  is  satisfied,  step  10. 
which  is  the  same  as  step  3,  computes  a  rerouting  path.  After 
the  rerouting  path  is  found,  the  algorithm  returns  to  step  4b. 
to  check  for  further  blockages  on  the  rerouting  path. 

For  each  source/destination  pair,  a  link  on  some  routing 
path  for  the  source/destination  pair  is  called  a  participating 


\U  it  aL:  DESTINATION  TAG  ROUTING  TECHNIQUES  BASED  ON  STATE  MODEL  FOR  IADM  NETWORK 


283 


nk.  As  a  direct  result  of  Theorem  3.2,  the  set  of  participating 
input  links  of  a  switch  is  composed  of  either  its  straight 
utput  link  or  both  of  its  nonstraight  output  links,  but  never 
II  of  them.  So  the  output  link  blockages  of  a  switch,  for 
given  source/destination  pair,  can  only  be  a  straight  link 
lockage,  a  nonstraight  link  blockage,  or  a  double  nonstraight 
nk  blockage.  Algorithm  BACKTRACK  deals  with  the  first 
nd  third  kind  of  blockages,  and  the  second  kind  of  block- 
ge  can  be  overcome  by  applying  Corollary  4.1.  Algorithm 
IACKTRACK  and  Corollary  4.1  can  be  used  to  form  a  uni- 
crsal  algorithm  capable  of  rerouting  messages  when  multiple 
lockages  exist  in  the  IADM  network.  This  algorithm,  called 
REROUTE,  returns  state  bits  specifying  a  blockage- 

roe  rerouting  path  if  one  exists,  or  returns  FAIL  otherwise. 
Algorithm  REROUTE  {P-b'n/2n_ t) 
i):  P  —  the  original  routing  path. 

bn/2n-i  =  the  routing  tag  specifying  the  original  routing 
path. 

^n/2n— i  =  the  rerouting  tag  specifying  the  rerouting 
path. 

b'n/ln—  l  *“  bn/2n—  1  • 

I :  Let  i  be  the  smallest  stage  number  such  that  there  exists 
a  blockage  at  stage  i  on  path  P.  If  no  blockages  occur 
on  path  P,  retum(h,rl/2n_1). 

2:  If  the  blockage  at  stage  i.  on  path  P  is  a  nonstraight  link 
blockage  and  the  other  nonstraight  link  is  not  blocked, 
apply  Corollary  4.1  to  find  state  bits  b'n,2n-i  and  go  to 
step  4. 

K,zn- 1  ~  BACKTRACK! P.i.  b'rl/2n_l). 

4:  Q  =  the  rerouting  path  specified  by  state  bits  b'n/2n_v 
P  —  Q  and  go  to  step  1. 

Step  0  is  the  initialization  step.  At  the  end  of  each  iteration, 
blockage-free  path  from  stage  0  to  stage  i  is  found.  Then 
new  iteration  starts  and  i  is  given  a  new  value  in  order  to 
nd  a  path  avoiding  the  blockages  at  a  higher-order  stage.  The 
nly  terminating  conditions  for  algorithm  REROUTE  are  that 
return  of  FAIL  from  step  3  indicating  that  no  blockage- 
ee  paths  exist  and  the  return  from  step  l  indicating  a 
lockage-free  path  is  found.  Algorithm  REROUTE  is  executed 
eratively  to  evade  blockages  from  lower-order  to  higher-order 
ages.  The  correctness  of  this  algorithm  follows  from  the 
irrectness  of  algorithm  BACKTRACK  and  Corollary  4.1. 

VI.  PERMUTATION  ROUTING  AND  CUBE  SUBGRAPHS 

of  the  IADM  Network 

The  results  discussed  so  far  are  a  consequence  of  the 
.istence  of  spare  nonstraight  links  in  addition  to  the  ICube 
jtwork  embedded  in  the  LADM  network.  This  section  pursues 
is  issue  further  by  showing  that  there  exist  multiple  distinct 
:ographs  in  the  IADM  network,  each  called  a  cube  subgraph, 
at  are  isomorphic  to  the  [Cube  network.  Two  cube  subgraphs 
a  considered  to  be  distinct  if  they  differ  in  at  least  one  link. 

>  mentioned  in  the  Introduction  of  this  paper,  the  cube-type 
.tworks  have  been  studied  extensively  in  the  literature  and 
lown  to  be  topologicaily^quivalent.  Together  with  results 
•m  these  studies,  the  knowledge  of  how  to  identify  cube 
ngraphs  can  help  the  understanding  of  the  capabilities  of 


the  IADM  network  and  be  useful  for  permutation  routing  in 
the  IADM  network. 

Since  each  switch  can  be  in  state  C  or  C,  there  are  as 
many  as  2'V  n(=  Nn)  network  states,  although  each  does  not 
necessarily  generate  a  unique  permutation.  Setting  a  switch  to 
a  certain  state  indicates  that  one  of  its  nonstraight  output  links 
can  be  used  for  routing  (i.e.,  it  is  active)  while  the  other  cannot. 
Thus,  each  network  state  can  be  associated  with  a  subgraph  of 
the  IADM  network  which  contains  only  the  active  links.  When 
all  switches  in  the  IADM  network  are  set  to  state  C,  the  IADM 
network  functions  as  an  ICube  network;  this  network  state 
corresponds  to  a  cube  subgraph.  The  constructive  derivation 
of  a  lower  bound  for  the  number  of  cube  subgraphs  of  the 
IADM  network  uses  the  two  basic  ideas  discussed  in  the  next 
paragraphs. 

Since +2n-1  =  -2"'1  mod  N,  t„-i)  = 

tn- 1),  i.e.,  the  state  of  each  switch  of  stage  ra  —  1  is  irrelevant 
in  the  sense  that  any  switch  at  stage  n  —  1  is  always  connected 
to  the  same  two  switches  at  stage  n.  Consequently,  given  any 
cube  subgraph,  there  exist  (2lV  —  1)  subgraphs  isomorphic  to 
it  which  differ  only  in  their  choices  of  the  nonstraight  link 
-l-2n-1  or  -2n_1  at  stage  n  —  1.  Therefore,  the  total  number 
of  distinct  cube  subgraphs  is  given  by  the  product  of  2iV  and 
the  number  of  distinct  subgraphs  of  the  IADM  network  from 
stage  0  to  stage  n  -  2  that  are  isomorphic  to  the  same  stages 
in  the  ICube  network. 

The  calculation  of  the  number  of  subgraphs  in  the  first 
n  —  1  stages  uses  an  idea  similar  to  that  proposed  in  [5] 
for  reconfiguring  the  DR  network  so  that  it  performs  as  a 
Generalized  Cube  network.  All  switches  of  the  IADM  network 
are  logically  relabeled  by  adding  a  constant  x,  0  <  x  <  N  -  I 
to  the  original  labels,  i.e.,  switch  j  becomes  f  —  j  +  x.  By 
setting  each  switch  to  be  an  even*  or  oddi  switch  according 
to  its  new  label  and  having  all  switches  be  in  state  C,  a 
cube  subgraph  results  for  each  relabeling.  However,  of  the 
iV  possible  subgraphs,  only  N/2  are  distinct  as  far  as  the  first 
n  —  1  stages  are  concerned.  This  result  is  stated  in  Theorem 
6.1  and  the  proof  appears  in  [22].  A  graphical  interpretation 
of  cube  subgraph  isomorphism  for  an  IADM  network  of  size 
iV  =  8  is  illustrated  in  Fig.  8.  In  Fig.  8.  each  physical  switch  j 
acts  as  a  logical  switch  j'  =  (j  +  1)  mod  8.  The  isomorphism 
to  the  ICube  network  can  be  easily  visualized  by  moving 
switch  7  to  the  top  of  each  stage  as  shown  in  the  figure. 
Notice  that  setting  some  switch  to  state  C  according  to  its 
logical  label  may  be  equivalent  to  setting  the  switch  to  state 
C  according  to  its  original  label.  For  instance,  switch  0  €  So 
(logical  label  1)  is  set  to  state  C  in  Fig.  8. 

Theorem  6.1:  There  exist  at  least  N/2  •  2'v  distinct  cube 
subgraphs  in  the  IADM  network. 

In  order  to  reconfigure  the  IADM  network  to  one  of  its  cube 
subgraphs,  each  switch  of  stage  i,  for  0  <  i  <  n  —  2,  needs 
to  know  the  uh  bit  of  its  logical  label.  This  can  be  done  by 
sending  the  same  logical  label  to  every  switch  in  the  same  row 
at  system  reconfiguration  time.  Each  switch  is  set  as  being  an 
oddi  or  everii  switch  by  examining  the  itfr  bit  of  the  togical 
label.  All  switches  operate  in  state  C7  according  to  its  logical 
label  with  the  exception  of  those  at  stage  n  -  1  for  which 
different  states  correspond  to  different  subgraphs. 


:34 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL.  41.  NO.  3.  MARCH  1992 


Fig.  3.  A  cube  subgraph  generated  by  relabeling  each  switch  J 
(j  +  1)  mod  8  for  an  LAOM  network  of  size  .V  =  3. 


The  results  of  this  section  can  be  used  in  different  ways.  One 
usage  is  in  characterizing  a  class  of  permutations  performable 
by  the  LAD  to  network.  Permutations  passable  by  the  ICube 
network  are  discussed  in  [15]  and  adaptable  from  [6].  Thus, 
the  IADM  network  can  perform  all  of  these  permutations  plus 
the  same  set  of  permutations  with  a  given  x  added  to  both  the 
same  source  and  destination  labels,  0  <  x  <  N/2.  Another  use 
of  the  results  of  this  section  is  that  the  IADM  network  can  pass 
the  permutations  performable  by  the  ICube  network  when  the 
[Cube  network  embedded  in  the  IADM  network  experiences 
nonstraight  link  failures.  This  is  done  by  incorporating  a 
reconfiguration  function  in  the  system  that  reassigns  each 
switch  j  to  ( j  +  x)  and  reconfiguring  the  IADM  network  to  a 
corresponding  cube  subgraph  which  does  not  include  the  faulty 
nonstraight  links.  In  [21]  it  is  shown  that  any  of  the  cube-type 
networks  can  pass  the  permutations  performable  by  the  others 
by  incorporating  appropriate  reconfiguration  functions.  By  the 
same  token,  the  IADM  network  with  a  nonstraight  link  fault 
can  also  pass  the  permutations  performable  by  the  cube-type 
networks  by  including  these  reconfiguration  functions  in  the 
system. 


VII.  Concluding  Remarks 

One  of  the  main  contributions  of  this  paper  is  the  identifica¬ 
tion  of  destination  tag  routing  schemes  for  the  LADM  network. 
They  are  simpler  and  more  efficient  than  previously  known 
approaches,  thus  requiring  less  complex  switches  and  reducing 
message  communication  delays  due  to  routing  overhead.  In  the 
SSDT  scheme  rerouting  can  be  done  when  nonstraight  links 
fail  and  in-  the  TSDT  scheme  both  the  straight  and  double 
nonstraight  link  blockages  can  be  avoided.  As  for  the  SSDT 
scheme,  routing  and  rerouting  are  transparent  to  the  source  and 
only  negligible  hardware  and  time  are  used  by  each  switch 
for  routing  and  rerouting  purpose.  These  are  considerable 
advantages  over  previously  proposed  schemes  which  do  not 
use  destination  tags  and  require  extra  hardware  or  delays  of 
0(!ogiV)  complexity  instead  of  0(1).  In  addition,  previous 
works  all  deal  only  with  certain  types  of  blockages.  Based  on 
the  TSDT  scheme,  a  universal  rerouting  algorithm  is  derived, 
which  is  capable  of  avoiding  any  combination  of  multiple 


blockages  if  there  exists  a  blockage-free  path  and  indicating 
absence  of  such  a  path  if  there  exists  none.  The  rerouting 
capabilities  of  the  new  schemes  can  be  readily  used  for  fault- 
tolerance  and  load  balancing  purposes  since  they  adequately 
exploit  the  redundancy  available  in  the  IADM  network. 

Another  contribution  of  this  paper  is  the  constructive  deriva¬ 
tion  of  a  lower  bound  on  the  number  of  cube  subgraphs  of 
the  LADM  network.  While  it  was  previously  known  that  the 
ICube  network  is  a  subgraph  of  the  LADM  network,  this  paper 
shows  that  there  exist  at  least  :V/2  ■  2'v  distinct  cube  sub¬ 
graphs.  This,  combined  with  previous  multistage  cube  network 
studies,  can  help  characterize  some  of  the  permutations  per¬ 
formable  by  the  LADM  network.  As  other  use  of  the  subgraph 
analysis,  it  is  shown  how  to  reconfigure  the  IADM  network 
under  nonstraight  link  faults  to  pass  the  cube-admissible 
permutations. 

Perhaps  the  most  fundamental  contribution  of  this  paper  is 
that  of  the  network  state  model  used  for  the  LADM  and  the 
ICube  networks.  The  essence  of  this  model  is  in  the  recognition 
that  the  routing  action  of  each  switch  is  conceptually  depen¬ 
dent  on  its  position  in  the  network  (topological  information), 
its  state  (functional  information),  and  the  destination  of  the 
message  (routing  information).  Topological  information  is 
fixed  and.  when  using  destination  tags,  the  same  can  be 
said  of  routing  information  for  a  given  message  destination. 
Consequently,  the  routing  path  is  solely  determined  by  the 
state  of  the  network.  These  basic  concepts  are  applicable  to 
networks  other  than  those  considered  in  this  paper;  the  state 
model  can  help  devise  new  designs,  solve  routing  problems, 
and  understand  relationships  among  networks. 

References 

[1]  D.  Agrawal,  “Graph  theoretical  analysis  and  design  of  multistage  inter¬ 
connection  networks,"  IEEE  Trans.  CompuL.  vol.  C-32.  pp.  637-648, 
July  1983. 

[2]  K.  E.  Batcher,  “The  Flip  network  in  STARAN,"  in  Proc.  1976  InL  Conf. 
Parallel  Processing,  Aug.  1976,  pp.  65-71. 

[3]  T.-Y  Feng,  “Data  manipulating  functions  in  parallel  processors  and  their 
implementations,"  IEEE  Trans.  CompuL.  voh  C-23,  pp.  309-318,  Mar. 
1974. 

[4]  L.R.  Goke  and  G.J.  Lipovski.  “Banyan  networks  for  partitioning 
multiprocessor  systems."  in  Proc.  1st  Anna.  Symp.  CompuL  Architecture. 
Dec.  1973.  pp.  21-28. 

[5]  M.  Jeng  and  H.  J.  Siegel.  “Design  and  analysis  of  dynamic  redundancy 
networks,"  IEEE  Trans.  CompuL.  vol.  37.  pp.  1019- 1029,  Sept  1988. 

[6]  D.  H.  Lawrie,  “Access  and  alignment  of  data  in  an  army  processor." 
IEEE  Trans.  CompuL,  vol.  C-24.  pp.  1145-1155,  Dec.  1975. 

[7J  D.  Lee  and  K.Y.  Lee.  “Control  algorithms  for  the  augmented  data 
manipulator  network."  in  Proc.  1986  InL  Conf.  Parallel  Processing, 
Aug.  1986,  pp.  123-130. 

[8]  M.  Maiek  and  W.  W.  Myre.  “A  description  method  of  interconection  net¬ 
works."  IEEE  Tech.  Committee  Distrib.  Process.,  Quart  vol.  1.  pp.  1-6, 
Feb.  1981. 

[9]  R.J.  McMillen  and  H.J.  Siegel,  “Routing  schemes  for  the  augmented 
data  manipulator  network  in  an  MIMD  system,"  IEEE  Trans.  CompuL, 
vol.  C-31,  pp.  1202-1214.  Dec.  1982. 

[10]  _ _  “Performance  and  fault  tolerance  improvements  in  the  inverse 

augmented  data  manipulator  network."  in  Proc.  9th  Annu.  Symp.  Com¬ 
puL  Architecture.  Apr.  1982.  pp.  63-72. 

[11]  ,  “Evaluation  of  cube  and  data  manipulator  networks."  J.  Parallel 
Distributed  CompuL.  vol.  2.  no.  1.  pp.  *9-107,  Feb.  1985. 

[12]  K.  Padmanabhan  and  D.  H.  Lawrie.  “A  class  of  redundant  path  mul¬ 
tistage  interconnection  networks."  IEEE  Trans.  CompuL.  vol.  C-32. 
pp.  1099-1108.  Dec.  1983. 

[13]  D.  S.  Parker  and  C.  S.  Raghavendra.  “The  Gamma  network;  A  multi¬ 
processor  interconnnection  network  with  redundant  paths."  in  Proc,  9th 
Annu.  Symp.  CompuL  Architecture,  Apr.  1982,  ppi  73-80. 


tAU  «  at:  DESTINATION  TAG  ROUTING  TECHNIQUES  BASED  ON  STATE  MODEL  FOR  [ADM  NETWORK 


283 


|  !4|  _ _  “The  Gamma  network,"  IEEE  Trans.  Compm,  voJ.  C-33. 

pp.  367-373.  Apr.  1984. 

[13 1  M.C  Pease.  Ill,  “The  indirect  binary  n-cube  microprocessor  array,” 
IEEE  Trans.  Comput,  vol.  C-26.  pp.  458-473,  May  1977. 

(16]  H.  J.  Siegel,  Interconnection  Networks  for  Large-Scale  Parallel  Process¬ 
ing:  Theory  and  Case  Studies,  second  ed.  New  York:  McGraw-Hill, 
1990. 

( 17]  H.l.  Siegel  and  S.  0.  Smith,  “Study  of  multistage  SIMD  interconnection 
networks.”  in  Proc.  5th  Anna.  Symp.  Comput.  Architecture.  Apr.  1978, 
pp.  223-229. 

(18]  H.l.  Siegel  and  R.  J.  McMillcn.  “The  multistage  cube:  A  versatile 
interconnection  network,”  IEEE  Comput.  Mag.,  vol.  14.  pp.  65-76, 
Dec.  1981. 

1 19|  A.  Varma  and  C.S.  Raghavendra.  “On  permutations  passable  by 
the  Gamma  network.”  J.  Parallel  Distributed  Comput,  vol.  3.  no.  1, 
pp.  72-91.  Mar.  1986. 

[20|  C-L  Wu  and  T-Y.  Feng,  “On  a  class  of  multistage,  interconnection 
networks,”  IEEE  Trans.  Comput.  vol.  C-29,  pp.  694-  702.  Aug.  1980. 

(21]  “The  reverse -exchange  interconnection  network.”  IEEE  Trans. 
Comput.  vol.  C-29.  pp.  801-811.  Sept  1980. 

(22]  D.  Rau.  J.A.B.  Fortes,  and  H.l.  Siegel,  “Destination  tag  routing 
techniques  based  on  a  state  model  for  the  IADM  network.”  Tech.  Rep. 
TR-EE  37-39.  School  of  Electrical  Engineering,  Purdue  Univ..  West 
Lafayette.  IN,  47907.  Oct.  1987. 


Darwea  Rau  received  the  Ph.D.  degree  in  elec¬ 
trical  engineering  from  Purdue  University.  West 
Lafayette.  IN.  in  1988.  the  M.S.  degree  in  mechan¬ 
ical  engineering  from  the  University  of  Massachu¬ 
setts.  and  the  B.S.  degree  in  industrial  engineering 
from  National  Tsing  Hua  University,  Taiwan. 

He  is  a  member  of  the  Technical  Staff  at  AT&T 
Bell  Laboratories.  Naperville.  IL.  Prior  to  joining 
Bell  Laboratories,  he  was  a  Senior  Software  Engi¬ 
neer  at  Digital  Equipment  Corporation.  His  research 
interests  include  interconnection  networks.  ISDN, 
-oftware  maintenance,  and  multimedia. 


Jose  A.B.  Fortes  (S’80-M’84>  for  a  photograph  and  biography,  .see  the 
February  1992  issue  of  this  Transactions,  p.  206. 


Howard  Joy-  Siegel  (M'77-SM’82  -  F’90)  received 
two  B.S.  degrees  from  MIT,  and  the  M.A.,  M.S.E., 
and  Ph.D.  degrees  from  Princeton  University. 

He  is  a  Professor  and  Coordinator  of  the  Parallel 
Processing  Laboratory  in  the  School  of  Electrical 
Engineering  at  Purdue  University.  He  has.  coau¬ 
thored  over  150  technical  papers,  coedited  four 
volumes,  and  written  one  book.  He  does  consulting, 
has  prepared  two  “continuing  education”  eleven- 
hour  video-  tape  courses,  and  has  given  tutorials 
for  various  organizations  in  the  U.SJL.  Europe,  and 
IsiaeL  all  in  the  area  of  parallel  computing 
Dr.  Siegel  has  served  as  program  Co-Chair  of  the  1983  International 
Conference  on  Parallel  Processing  and  as  General  Chair  of  the  3rd  Inter¬ 
national  Conference  on  Distributed  Computing  Systems  and  the  15th  Annual 
International  Symposium  on  Computer  Architecture.  He  is  Coeditor-in-Chief 
of  the  Journal  of  Parallel  and  Distributed  Computing,  and  Program  Chair 
of  Frontiers  '92:  The  4th  Symposium  on  the  Frontiers  of  Massively  Parallel 
Computations.  His  current  research  focuses  on  interconnection  networks  and 
the  use  and  design  of  the  PASM  reconligurabie  parallel  computer  system  (a 
prototype  of  which  is  supporting  active  experimentation).  He  is  a  member  of 
the  Eta  Kappa  Nu  and  Sigma  Xi  honorary  societies. 


REFERENCE  NO.  6 


Shang,  W.  and  Fortes,  J.  A.  B.,  "On  Mapping  Uniform  Dependence  Algorithms  into 
Lower  Dimensional  Processor  Arrays,"  IEEE  Transactions  on  Parallel  and  Distri¬ 
buted  Systems,  Volume  3,  Number  3,  May  1992,  pp.  350-363. 

Note  -  Two-dimensional  processors  with  faulty  elements  can  be  easily  reconfigured 
as  one-dimensional  arrays.  However,  it  is  necessary  to  remap  the  original  algorithm 
into  an  array  of  smaller  dimension  than  that  of  the  original  array.  This  paper  presents 
a  mapping  methodology  which  addresses  this  problem. 


3S0 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL  3.  NO.  3.  MAY  1992 


On  Time  Mapping  of  Uniform 
Dependence  Algorithms  into  Lower 
Dimensional  Processor  Arrays 

Weijia  Shang,  Member,  IEEE ,  and  Jose  A.  B.  Fortes,  Member,  IEEE 


Abstract — Most  existing  methods  of  mapping  algorithms  into 
processor  arrays  are  restricted'  to  the-  cas*  where  i -dimensional 
algorithms,  or  algorithm*  wide  n  nested  loop*,  are  mapped  into 
In-  1) -dimensional  array*  However,  in  practice,  it  is  interesting 
to  map  n -dimensional  algorithms  into ( Jfc — 1 ) -dimensional  arrays 
where  k  <  n.  For  example,  many  algorithm*  ah  bit  level  are 
at  least  f<HH>dhnensk>nai  (matrix  moitipifeattom.  convolution,  LU 
decomposition,  etc)  and  most  existing  bit  level  processor  arrays 
are  two-dimensional.  A  computational  conflict  occurs  if  two- or 
more  computation*  of  a*  algorithm  are  mapped  into  the  same 
processor  and  the  same  execution:  time  Io  this  paper,  based  on 
the  Hermite  normal  form  of  the  mapping  matrix,  necessary  and 
sufficient  conditions  are  derived  to  identify  mappings  without 
computational  conflicts.  These  conditions  are  used  to  And  time 
mappings  of  ft -dimensional  algorithms  into  ( Jfc  -  1) -dimensional 
arrays,  k  <  n,  without  computational  conflicts.  For  some  appli¬ 
cations,  the  mapping  is  time-optimal. 

Index  Terms —  Bit  level  algorithm,  conflict-free,  nested  loops, 
optimal  time  mapping,  processor  array. 


I.  Introduction 

MOST  existing  methods  of  mapping  algorithms  into 
processor  arrays  are  restricted  to  the  cases  where  n- 
dimensional  algorithms,  or  algorithms  with  n  nested  loops,  are 
mapped  into  (n  -  l)-dimensionai  processor  arrays  [2]-[13J. 
For  example,  the  three-dimensional  matrix  multiplication  al¬ 
gorithm  is  usually  mapped  into  a  two-dimensional  processor 
array  by  these  methods  [10],  [21],  [32].  This  paper  considers 
time  mappings  of  n-dimensional  algorithm*  into  (k  -  l)~ 
dimensional,  k  <  n,  processor  arrays.  Procedures  are  proposed 
to  find  time  mappings  (space  mappings  are  assumed  to  be 
given)  without  computational  conflicts,  which  means  no  two 
or  more  computations  of  the  algorithm  are  mapped  into  the 
same  processor  and  execution  time. 

.Algorithms  under  consideration  in  this  paper  are  nested 
loops  with  regular  data  dependence  structures.  Such  algorithms 
can  be  modeled  by  a  uniform  dependence  algorithm  (J,  D).  Set 

Manuscript  received  May  1.  1990:  revised  June  11.  1991.  This  work  was 
supported  in  pan  by  Louisiana  Education  Quality  Support  Fund  under  Contract 
LEQSF(1991-93)-RD-A-42,  by  the  National  Science  Foundation  under  Grant 
DC  1-84 19745.  and  by  the  Innovative  Science  tnd  Technology  Office  of  the 
Strategic  Defense  Initiative  Organization,  and  was  administered  through  the 
Office  of  Naval  Research  under  Contracts  00014-85-k-0588,  00014-88-k- 
0723.  and  0001 4-90-J- 1483. 

W.  Shang  is  with  the  Center  for  Advanced  Computer  Studies,  University 
of  Southwestern  Louisiana.  Lafayette.  LA  70504. 

1.  A.  B.  Fortes  is  with  the  School  of  Electrical  Engineering,  Purdue  Univer¬ 
sity.  West  Lafayette.  IN  47907. 

IEEE  Log  Number  9107409. 


J  is  the  index  set,  or  iteration  space.  Each  element  in  J  is  an 
ra-tupie  integral  column  vector  (called  iteration  vector  or  index 
vector).  Matrix  D  is  the  dependence  matrix  where  each  column 
is  a  dependence  vector.  If  computation  in  one  iteration  depends 
on  the  computation  in  another  iteration,  this  dependence  is 
represented  by  the  vector  difference  of  the  iteration  vectors 
corresponding  to  these  two  iterations.  All  dependences  are 
assumed  to  be  uniform  which  means  the  dependence  relation 
exists  between  two  iterations  as  long  as  the  vector  difference 
between  these  two  iterations  is  equal  to  the  vector  representing 
that  dependence  relation.  This  algorithm  model  can  be  easily 
related  to  similar  models  and  concepts  in  [1]— [13],  [19]  and 
several  other  works.  Uniform  dependence  algorithms  occur 
frequently  in  digital  signal  and  image  processing  and  scientific 
computing  applications  [36]. 

Examples  of  two-dimensional  bit  level  processor  arravs 
include  GAPP  [33],  DAP  [34],  MPP  [35],  MP-1  [31]  etc.  Many 
bit  level  algorithms  are  four  or  five  dimensional,  such  as  matrix 
multiplication,  convolution,  LU  decomposition,  etc.  How  to 
automatically  map  these  algorithms  into  two-dimensional  bit 
level  arrays  is  still  a  problem  [28].  That  is  why  in  practice 
it  is  interesting  to  develop  a  method  to  map  n-dimensionai 
algorithms  into  ( k  -  l)-dimensional  processor  arrays  with 
k  <  n.  The  work  reported  in  this  paper  was  motivated  by 
the  implementation  of  RAB  (Reconfiguration  Algorithm  for 
Bit  level  code)  [26],  an  experimental  tool  which  maps  a  class 
of  algorithms  programmed  in  “C”  into  bit  level  arrays.  In  the 
RAB  approach,  algorithms  are  first  expanded  into  bit  level, 
and  second,  the  dependence  relations  are  analyzed  and  the 
algorithm  is  uniformized.  Then  the  global  optimal  solution, 
which  maps  often  a  four  or  five  dimensional  bit  level  algorithm 
into  a  two-dimensional  bit  level  processor  array,  is  to  be  found. 

Several  attempts  have  been  made  to  try  to  map  algorithms 
into  lower  dimensional  systolic  arrays  [15],  [22],  [23],  [25]. 
In  particular,  important  steps  toward  a  formal  solution  to  this 
problem  were  made  in  [23].  Based  on  the  Lamport  hyper- 
plane  transformation  model  [13],  a  procedure  was  proposed 
to  find  mappings  of  three-dimensional  algorithms  into  one- 
dimensional,  or  linear  systolic  arrays,  without  computational 
conflicts  and  data  link  collisions.  Five  conditions  were  given  to 
guarantee  the  correctness  of  the  mapping.  The  first  condition 
ensures  that  dependence  relations  among  different  computa¬ 
tions  of  the  algorithm  are  respected;  the  second  condition  is 
about  computational  conflicts;  the  third  and  fourth  conditions 
deal  with  the  number  of  shift  registers  on  links  and  the  data 


1045-9219/92503.00  ©  1992  IEEE 


HANG  AND  "ORTES;  TIME  MAPPING  OF  UNIFORM  DEPENDENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


351 


ravel  directions;  and  the  fifth  condition  is  to  avoid  data 
I  ink  collisions.  The  concept  of  data  link  collisions  and  the 
conditions  to  avoid  such  collisions  are  introduced  in  this 
work.  Detection  of  computational  conflicts  is  basically  by  a 
heuristic  analysis  of  all  computations  of  the  algorithm  and  the 
optimality  of  the  mapping  is  not  guaranteed.  A  subopdmal 
solution  for  the  matrix  multiplication  algorithm  was  found  in 
(23}  by  which  the  total  execution  time  is  (3  4-  p)p  +  1  for 
multiplying  two  (/i  +  1)  x  (^i  +  1)  matrices.  In  [22],  further 
results  are  reported  in  mapping  n-dimensional  algorithms  into 
k  -  L)-dimensional  processor  arrays.  A  subopdmal  solution 
for  the  reindexed  transitive  closure  algorithm  [17],  [23]  was 
found  by  the  proposed  procedure  in  [22]  by  which  the  total 
execution  time  is  n(2 p.  4-  3)  4-  1  where  p  is  the  problem  size. 

This  paper  describes  a  method  of  finding  time  mappings 
of  n-dimensional  uniform  dependence  algorithms  into  (k  - 

l)  -dimensional  arrays,  k  <  n,  without  any  computational 
conflicts.  Based  on  the  Hermite  normal  form-  of  the  mapping 
matrix,  necessary  and  sufficient  conditions  are  derived  to 
guarantee  a  conflict-free  mapping.  These  conditions  are  used  to 
formulate  the  problem  of  finding  time-optimal  and  conflict-free 
mappings  as  an  integer  programming  problem.  The  mapping  is 
optimal  for  some  applications.  Compared  to  the  method  in  [22] 
and  [23],  the  main  contributions  of  this  paper  are  the  closed 
form  necessary  and  sufficient  conditions  for  conflict-free  map¬ 
pings  based  on  the  Hermite  normal  form  of  the  mapping 
matrix.  In  addition,  by  using  these  conditions,  the  problem 
of  identifying  time-optimal  and  conflict-free  mappings  is  for¬ 
mulated  as  an  integer  programming  optimization  problem.  For 
some  algorithms  such  as  the  matrix  multiplication  algorithm 
and  the  transitive  closure  algorithm,  the  integer  programming 
formulation  can  be  further  converted  to  linear  programming 
problems.  In  Section  V,  the  method  proposed  in  this  paper 
is  used  to  find  optimal  solutions  for  the  matrix  multiplication 
algorithm  and  the  reindexed  transitive  closure  algorithm  which 
improve  the  total  execution  time  of  (3  4-  p)p  4- 1  in  [23]  and 
p(2p  4-  3)  4-  1  in  [22]  to  (2  4-  p)p  4-  1  and  p(p  4-  3}  4-  l, 
respectively,  where  p  is  the  problem  size. 

This  paper  is  organized  as  follows.  Section  II  presents 
basic  terminology  and  definitions,  introduces  the  concept  of 
computational  conflicts,  and  provides  statements  of  problems 
addressed  in  this  paper.  Section  III  discusses  a  simple  case 
to  illustrate  different  aspects  of,  and  provide  insight  into, 
the  conflict-free  mapping  problem.  Section  IV  discusses  the 
conflict-free  mapping  problem  in  general.  Section  V  presents 
an  optimization  procedure  and  integer  programming  problem 
formulations  which  can  be  used  to  find  mappings  without  any 
computational  conflicts.  Section  VI  concludes  this  paper  and 
points  out  some  future  work. 

II.  Terminology  and  Definitions 

Throughout  this  paper,  sets,  matrices,  and  row  vectors  are 
denoted  by  capital  letters,  column  vectors  are  represented  by 
iower  case  symbols  with  an  overbar  and  scalars  correspond 
to  lower  case  tetters.  The  transpose  of  a  vector  v  is  denoted 
r.  The  vector  0  denote!  the  row  or  column  vector  whose 
entries  are  all  zeroes.  The  dimensions  of  vector  0  and  whether 


it  denotes  a  row  or  column  vector  are  implied  by  the  context 
in  which  they  are  used.  The  symbol  /  denotes  the  identity 
matrix.  The  rank  and  the  determinant  of  matrix  A  are  denoted 
rank(A)  and  det  A,  respectively.  The  set  of  integers,  the  set 
of  nonnegative  integers,  and  the  set  )f  positive  integers  are 
denoted  Z,  N,  and  N+,  respectively.  The  empty  set  is  denoted 
3.  The  notations  |C|  and  |a|  represent  :he  cardinality,  or  the 
number  of  elements,  of  set  C  and  the  absolute  value  of  scalar 
a,  respectively.  Let  u  and  u  be  two  vectors.  Then  v  >  u 
means  every  component  of  v  is  greater  than  or  equal  to  the 
corresponding  component  of  u.  Finally,  if  x  is  an  element  of 
a  set  S,  the  notation  x  6  S  is  used  and  this  notation  is  also 
used  to  indicate  that  a  column  vector  fhj  (or  row  vector  Mi) 
is  a  column  (row)  of  a  matrix  M,  i.e.,  rhj  e  M  (Mi  6  M) 
means  fhj  (Mi)  is  a  column  (row)  vector  of  matrix  M. 

Algorithms  of  interest  in  this  paper  are  the  so-called  uniform 
dependence  algorithms  defined  as  follows. 

Definition  2.1  (Uniform  dependence  algorithm):  A  uniform 
dependence  algorithm-  is  an  algorithm  that  can  be  described 
by  an  equation  of  the  form 

u(J)  =  STj(v(J  ~d\),v(j  -d2),---.v(]  -  dm))  (2.1) 

where 

1)  j  6  J  C  Zn  is  an  index  point  (a  column  vector),  J  is 
the  index  set,  or  iteration  space,  of  the  algorithm  and 
n  6  iV+  is  the  algorithm  dimension,  or  the  number  of 
components  of  j; 

2)  i_r}  is  the  computation  indexed  by  j,  i.e.,  a  single- valued 
function  computed  “at  point  j”  in  a  single  unit  of  time-, 

3)  v(j)  is  the  value  computed  “at  j,"  i.e.,  the  result  of 
computing  the  right-hand  side  of  (2.1)  and 

4)  di  6  Zn.i  ~  1,  -  *  * ,  m, m  €  .V  are  dependence  vec¬ 
tors,  also  called  dependencies,  which  are  constant  (i.e., 
independent  of  j  e  J);  the  matrix  D  =  [d\,  ■  ■  ■ ,  dm]  is 
called  the  dependence  matrix. 

The  class  of  uniform  dependence  algorithms  is  a  simple 
extension  of  the  class  of  algorithms  described  by  uniform 
recurrence  equations  [1].  The  main  difference  is  that  uni¬ 
form  dependence  algorithms  allow  for  different  functions  to 
be  computed  (in  a  unit  of  time)  at  different  points  of  the 
index  set.  From  a  practical  viewpoint,  uniform  dependence 
algorithms  can  be  easily  related  to  programs  where  1)  a 
single  statement  appears  in  the  body  of  a  multiply  nested 
loop  and  2)  the  indexes  of  the  variable  in  the  left-hand  side 
of  the  statement  differ  by  a  constant  from  the  corresponding 
indexes  in  each  reference  to  the  same  variable  in  the  right- 
hand  side.  Alternative  computations  can  occur  in  each  iteration 
as  a  result  of  a  single  conditional  statement  as  long  as  data 
dependencies  do  not  change.  Nested  loop  programs  with 
multiple  statements  can  also  use  the  techniques  of  this  paper 
together  with  the  alignment  method  discussed  in  [14]  and  [24]. 
Uniform  dependence  algorithms  occur  frequently  in  scientific 
computing  and  digital  signal  processing  applications. 

For  the  purpose  of  finding  time-optimal  and  conflict-free 
mappings,  only  structural  information  of  the  algorithm,  i.e., 
the  index  set  J  and  the  dependence  matrix  D.  is  needed. 
Therefore,  a  uniform  dependence  algorithm  with  index  set 


332 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL  3,  NO.  3.  MAY  1992 


J  and  dependence  matrix  D  is  herein,  characterized  simply 
by  the  pair  ( J.D ).  Each  index  vector  j  £  J  corresponds  to 
a  computation;  and  computation  J  depends  on  computations 
j  -  li  €  J,  i  =  i,  ■  •  ■ ,  m.  It  is  assumed  that,  as  in  Definition 
2.1,  the  letters  n  and  m  always  denote  the  algorithm  dimension 
and  the  number  of  dependence  vectors,  respectively. 

Many  models  have  been  proposed  to  map  algorithms  into 
processor  arrays.  The  linear  algorithm  transformation  method 
proposed  in  [2],  [12],  and  [32]  is  used  in  this  paper  and  stated 
as  follows. 

Definition  2.2  (Linear  algorithm  transformation ):  A  linear 
algorithm  transformation  maps  an  n-dimensional  uniform  de¬ 
pendence  algorithm  into  a  (k- 1  )-dimensional  processor  array 
according  to  the  mapping: 

t  '■  J  —  Zk,  t(j)  =  Tj,  Vj  €  J  (2.2) 

where  T  =  j^j  £  Zkxn  is  the  mapping  matrix ;  5  £ 
Z(k- t)xn  js  tj,e  space  mapping  matrix ,  and  II  £  Zlxn 
is  the  time  mapping  vector,  or  linear  schedule  vector.  The 
computation  indexed  by  ]  £  J  is  executed  at  time  TLj  and 
at  processor  Sj.  The  mapping  r  must  satisfy  the  following 
conditions: 

1)  UD  >  0. 

2)  SD  =  PK  where  P  6  2’(fc-nxr  is  the  matrix  of 
interconnection  primitives  of  the  target  machine,  K  € 
Zrxm  is  such  that 

r 

kji  <  Udi,  i  =  1.  •  •  • .  m.  (2.3) 

j-t 

3)  Vji,  ]2  £  •/,  if  Ji  /  j2,  then  r(;t)  4  r(]2),  or 

Tji  jh  Tj2- 

4)  The  rank  of  T  is  equal  to  k,  or  rank(T)  =  k. 

Condition  1  in  Definition  2.2  preserves  the  partial  ordering 

induced  by  the  dependence  vectors.  If  this  condition  is  sat¬ 
isfied.  then  computation  indexed  by  j  £  J  is  scheduled  to 
execute  only  after  the  executions  of  computations  indexed  by 
j  -  di  £  J,  i  =  1.  •  •  • .  m  because  UD  >  0,  and  therefore  the 
dependence  relation  is  respected. 

The  matrix  of  interconnection  primitives  P  describes  the 
connection  links  of  processors  in  the  array.  For  at*  array  with 
each  processor  connected  to  its  four  nearest  east  south,  west, 
and  north  neighbors,  it  has  four  interconnection  primitives 
[0.  lF,  [0. -1]T.  (1.0]T,  and  [-l,0jr,  and  matrix  P  = 

^  0  J  q  .  Condition  2  in  Definition  2.2  guarantees 

tnat  the  space  mapping  can  be  implemented  in  a  fixed  systolic 
architecture  with  interconnection  primitive  matrix  P.  The 
summation  in  the  left-hand  side  of  the  inequality  in  (2.3)  is 
the  number  of  times  of  the  usage  of  interconnection  primitives 
to  pass  the  datum  caused  by  the  dependence  vector  di  from 
the  source  to  the  destination.  The  item  in  the  right  is  the  time 
units  between  the  source  usage  and  the  destination  usage  of 
that  datum.  Assuming  it  takes  one  time  unit  for  a  datum  to 
travel  one  interconnection  primitive,  the  inequality  must  be 
satisfied  to  have  the  datum  arrive  before  it  is  used.  Condition 
2  in  Definition  2.2  may  not  be  required  when  a  new  processor 
array  is  designed  specially  for  the  algorithm.  It  is  required  only 


when  the  algorithm  is  to  be  mapped  into  a  processor  array  with 
a  fixed  interconnection  structure. 

Condition  3  is  for  avoiding  computational  conflicts  because 
if  r(ji)  =  r(j 2),  then  the  computations  indexed  by  j  1 
and  j>2  are  mapped  into  the  same  processor  and  time  and  a 
conflict  occurs.  Condition  4  guarantees  that  the  algorithm  is 
to  be  mapped  into  a  (k  -  l)-dimensional  array  but  not  a  7- 
dimensional  array,  7  <  k  -  1.  When  rank(T)  =  7  +  1  <  k. 
there  are  exactly  7  + 1  linearly  independent  rows  in  T,  and  all 
other  rows  of  T  are  linear  combinations  of  these  q  + 1  linearly 
independent  rows.  Let  T'  be  the  matrix  consisting  of  these 
7+1  linearly  independent  rows;  then  T  can  be  transformed 
linearly  to  V  which  means  the  algorithm  is  actually  mapped 
into  a  7-dimensional  processor  array. 

More  constraints  on  the  mapping  r  are  possible  for  some 
implementation  requirements.  In  addition,  different  constraint 
forms  from  those  in  Definition  2.2  for  the  same  implementation 
requirement  can  be  used.  For  example,  in  [23],  the  inequality 
in  (2.3)  is  required  to  be  an  equality  which  means  data  must 
arrive  right  at  the  time  of  their  usage  and  are  not  allowed 
to  arrive  before  the  usage.  Also,  in  [23],  constraints  to  avoid 
data  link  collisions  are  considered. 

Because  the  execution  of  any  computation  needs  one  time 
unit  as  defined  in  Definition  2.1,  the  total  execution  time  by 
the  linear  schedule  vector  II  is  as  follows; 

t  =  inax{II(it  -]2)  ■  ji •  J2  €  J)  +  1.  (2.4) 

For  a  class  of  practical  algorithms,  the  loop  bounds  are 
constants.  This  kind  of  algorithm  is  characterized  by  the 
constant-bounded  index  set  defined  as 

•/  =  {[71,  ■  •  • .  jn}T  :0  <  ji  <  iM,ji  €  Z.  fM  €  iV+. 

i  =  1.  •  •  • .  n}  (2.5) 

where  zero  and  m  correspond  to  the  low^r  and  upper  bounds 
of  the  zth  loop,  respectively.  Upper  bounds  m,  i  =  1.  •  •  • ,  n. 
are  called  problem  size  variables.  To  simplify  the  problem, 
this  paper  is  restricted  to  the  algorithms  with  constant-bounded 
index  sets.  This  assumption  is  summarized  as  follows. 

Assumption  2.1:  In  this  paper,  the  index  sets  under  consid¬ 
eration  are  assumed  to  be  constant-bounded  defined  formally 
by  (2.5). 

Some  other  kinds  of  algorithms  can  be  transformed  into  al¬ 
gorithms  with  constant-bounded  index  sets  by  a  linear  mapping 
of  the  index  sets  [12].  For  an  algorithm  with  a  constant- 
bounded  index  set,  because 

max{IIO'i  -  h)  ■  Ji,h  €  J} 

=  -0),  (2.6) 

the  total  execution  time  t  in  (2.4)  can  be  simplified  to 

n 

t  =  1  +  irr,|/i,-.  (2.7) 

i=i 

From  (2.7),  the  vector  II  which  minimizes  the  objective 
function  t  in  (2.7)  is  such  that  the  absolute  values  of  its  entries 
jjrii,  i  =  l.  -.n  are  as  small  as  possible  and  with  some 
constraints  satisfied.  In  other  words,  if  the  absolute  value  of 


SHANG  AND  PORTS:  TIME  MAPPING  OF  UNIFORM  DEPENDENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


353 


any  one  of  the  entries  of  the  optimal  II  is  reduced  by  one,  then 
the  resulting  vector  is  not  a  valid  linear  schedule  vector.  This 
conclusion  is  also  indicated  in  [10]  and  [11]  and  is  summarized 

n  the  following  theorem. 

Theorem  2.1  [10],  [11]:  The  total  execution  time  described 
in  (2.4)  is  a  monotonically  increasing  function  of  1 7r,[ ,  i  = 
1.  •  •  • ,  n,  the  absolute  values  of  entries  of  vector  II. 

A  conflict  occurs  if  two  or  more  computations  are  mapped 
into  the  same  processor  and  the  same  execution  time.  That  is, 
for  two  distinct  index  points  j\  j2  £  J,  if  Tji  =  Tj2,  then 
there  is  a  conflict.  For  the  case  where  k  =  n,  or  T  is  a  square 
matrix,  that  rank(T)  =  n  guarantees  a  conflict-free  mapping 
because  Tj\  —  T]2  if  and  only  if  ji  =  j2.  For  the  case  where 
k  <  n,  even  when  rank(T)  =  k,  or  matrix  T  has  full  row 
rank,  there  is  at  least  one  nonzero  vector  y  such  that  T7  =  0. 
Let  ]i  —  n  +  7,  then  T~j\  =  T~j2.  If  both  j\  and  ]2  belong 
to  the  index  set,  then  the  computations  indexed  by  ji  and  j2, 
respectively,  are  mapped  to  the  same  processor  and  the  same 
execution  time  and  a  conflict  occurs.  Therefore,  it  is  much 
more  difficult  to  find  a  mapping  without  conflicts  when  k  <  n 
than  for  the  case  when  k  =  n. 

One  possible  way  to  avoid  conflicts  is  to  find  the  mapping 
matrix  T  such  that,  for  any  arbitrary  index  point  j  £  J  and 
any  7  that  is  a  nonzero  integral  solution  of  equation  Ty  =  0, 
j  •+•  7  does  not  belong  to  the  index  set  J.  This  concept  is 
illustrated  by  Fig.  1  which  shows  a  two-dimensional  index 
set  J  =  {(ji,j2]T:  0  <  j\,j2  <  L  ji,  j2_  €  Z}.  If  7  is 
7i  =  [1, 1|7,  then  index  points  j  =  0  and  ]  +  71  =  [1, l]7 
both  belong  to  index  set  J  and  computations  indexed  by 
;0.0jT,  [l,l]r,  [2, 2]7,---,[4, 4]7  will  be  mapped  into  the 
same  processor  and  the  same  execution  time.  Therefore,  there 
is  at  least  one  conflict  However,  if  7  is  72  =  [3, 5]T,  there  will 
be  no  conflict  at  all  because  for  any  arbitrary  j  £  J,  j +72  g  J. 
Intuitively,  if  vector  [3,5]r  is  drawn  with  one  end  at  [0,0]r 
(or  at  any  other  index  point  of  the  index  set),  then  the  other 
end  is  out  of  the  index  set  and  vector  [3, 5j7  does  not  meet  any 
integer  points  in  the  index  set.  Therefore,  the  mapping  with 
this  7  is  conflict-free.  To  describe  these  concepts  formally,  the 
following  definitions  are  introduced. 

Definition  2.3  (Conflict  vector,  feasible  and  nonfeasible  con¬ 
flict  vectors,  and  conflict-free  mapping  matrix):  Given  an  al¬ 
gorithm  (J,  D)  and  a  mapping  matrix  T  £  Zkxn,  an  inte¬ 
gral  column  vector  7  =  (71,...,  7„]7  is  a  conflict  vector 
of  the  mapping  matrix  T  if  and  only  if  =  0  and 
gcdi(7i,-  - .7„)  =  1.  If  for  any  arbitrary  index  point  j  £ 
J.j  4-7  g  J,  then  7  is  a  feasible  conflict  vector.  If  there 
exists  at  least  one  index  point  j  £  J  such  that  j  4-  7  €  J 
then  7  is  called  a  nonfeasible  conflict  vector.  If  all  the  conflict 
vectors  are  feasible,  then  this  mapping  matrix  T  is  conflict- 
free. 

Example  2.1:  Consider  a  four-dimensional  algorithm  ( J.  D) 
where 


J  —  {j  ■]  €  Z4 , 0  <  jj  <  6.  *  ss  1,  ■  •  • ,  4}. 

Assume  that  this  algorithm  is  to  be  mapped  into  a  one- 
? 

!gcd(at,  -.<■„)  denotes  Lbe  greatest  common  divisor  of  integers 

M  .  ■  ■  • . a„ . 


Fig.  1.  Nonfeasible  conflict  vector  7t  and  feasible  conflict  vector  73.  Vector 
72  does  not  meet  any  integral  points  inside  the  index  set. 


dimensional,  or  linear,  processor  array  and  one  possible  map¬ 
ping  matrix  is 


r  = 


1  r 
1  7 


(2.8) 


Consider  the  following  solutions  of  Ty  =  0:  y\  = 
[0, 1,  -7, 0]7,  72  =  [7,  — 1,0, 0]7  and  73  =  [1, 0,  —1, 0]7. 
Clearly,  T71  =  Ty2  =  T73  =  0  and  their  greatest  common 
divisors  of  their  entries  are  unity.  So  71, 72,  and  73  are  conflict 
vectors  of  mapping  matrix  T.  However,  vector  [2, 0,  -2. 0]7 
is  also  a  solution  of  equation  Ty  =  0  but  is  not  a  conflict 
vector  of  mapping  matrix  T  because  the  greatest  common 
divisor  of  its  entries  is  not  unity.  Conflict  vectors  7i  and  72 
are  feasible  because  it  can  be  checked  that  for  any  arbitrary 
index  point  j  £  J,  j  +  7*  g  J,  i  =s  1, 2.  Conflict  vector  73  is 
not  feasible  because  for  the  index  point  j  =  [0,0, 1,0]T  £  J, 
j  +  73  =  [1, 0, 0, 0}7  €  J.  Therefore,  T  is  not  conflict-free.  □ 

For  algorithms  with  constant-bounded  index  sets,  the  fol¬ 
lowing  theorem  describes  the  common  characteristics  of  fea¬ 
sible  conflict-vectors. 

Theorem  2.2:  For  algorithms  with  constant-bounded  in¬ 
dex  sets  defined  by  (2.5),  a  mapping  matrix  T  is  conflict- 
free  if  and  only  if  for  each  of  its  conflict  vectors  7  = 
[7i.  •  •  •  .7«» •  •  •  - 7n]T  there  exists  an  entry  yi  such  that  |7,|  > 
di- 

Proof:  {=>)■  Because  T  is  conflict-free,  ail  the  conflict 
vectors  of  T  are  feasible.  Now  suppose  that  7  is  a  conflict 
vector  of  T  and  |7i|  <  p%,  i  —  1,  •  •  • .  n.  Consider  the  index 
point  j  =  \j\,  ■  ,j„)r  where  ji  =  0  if  7*  >  0  and  ji  =  -7i 

if  7i  <  0.  Both  J  and  j  4-7  belong  to  the  index  set  J  defined 
by  (2.5)  because  |7<|  <  m,  i  —  1.-  •  • .  n.  By  Definition  2.3, 7 
is  not  feasible  which  is  contrary  to  the  assumption.  Therefore, 


354 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL.  3.  NO.  3.  MAY  1992 


for  each  of  the  conflict  vectors  7,  there  must  exist  an  entry 
7i  such  that  |7i|  >  Mt- 

(«=).  Let  7  be  a  conflict  vector  of  mapping  matrix  T  and 
consider  an  arbitrary  index  point  j  belonging  to  the  index  set 
defined  by  (2.5).  Let  j'  =  j-t-7  =  \j[ ,  ■  ■  • ,  fn]T.  Because  there 
exists  an  entry  7*  of  7  such  that  |7,- 1  >  m  and  m  >  ji  >  0, 
j'i  =  ji  +  7i  >  ‘f  7.  >  0  and  j\  =  ji  +  7i  <0  if  7t  <  0.  In 
both  cases,  ]'  is  not  in  the  index  set  J  and  7  is  feasible.  This 
implies  that  T  is  conflict-free.  □ 

In  practice,  it  is  interesting  to  And  optimal  conflict-free 
mappings  with  respect  to  different  criteria  and  based  on 
different  assumptions.  To  achieve  this,  one  has  to  identify 
first  all  feasible  mappings  that  are  conflict-free.  Then  it  is 
possible  to  choose  an  optimal  one  with  respect  to  a  certain 
criterion  from  these  conflict-free  mappings.  The  criterion  could 
be  the  total  execution  time  for  the  algorithm,  the  VLSI  area 
taken  ta  implement  this  algorithm  including  the  number  of 
processors  and  the  length  of  the  wiring,  or  the  combination 
of  the  total  execution- time  and  the  VLS£  area.  Two  problems 
are  addressed  in  this  paper  and  are  formulated  below.  The  first 
is  about  identifying  conflict-free  mappings  and  the  second  is 
about  finding  optimal  time  mappings  where  space  mappings 
are  given. 

Problem  2.1  (Conflict-free  mapping  problem):  Given  an 
n-dimensional  uniform  dependence  algorithm  and  a  (le  -  1)- 
dimensional  processor  array,  find  necessary  and  sufficient 
conditions  for  mapping  matrix  T  €  Zk*n  to  be  conflict-free. 

Problem  2.2  (Time-optimal  and  conflict-free  mapping  prob¬ 
lem):  Given  an  n-dimensional  uniform  dependence  algorithm 
(J,  D)  and  a  space  mapping  matrix  S  €  Z(h~l)xn,  find  an 
integral  row  vector  11°  e  ZLxn  which  minimizes 

/  =  max{II(Ji  -  h)  ■  Ji ,h  €  J] 

{TLD  >  0 

k7i  -  Hi.  where  SD  =  PK 
rank(T)  =  k 
T  =  [n]  is  conflict-free. 

In  Problem  2.2,  the  objective  function  /  differs  by  one  from 
the  total  execution  time  t  in  (2.4).  Clearly,  /  is  minimized 
if  and  only  if  t  is  minimized.  P  and  AT  are  as  defined  in 
Definition  2.2.  The  solution  of  a  special  case  of  Problem  2.1 
is  discussed  in  Section  III,  and  the  general  case  is  discussed 
in  Section  IV  followed  by  the  discussion  of  Problem  2.2. 

III.  Necessary  and  Sufficient  Conditions  for 
Conflict- Free  Mapping  Matrix  T  e  Z(n-l)xn 

This  section  discusses  the  solution  of  Problem  2.1,  or  how 
to  identify  all  conflict-free  mapping  matrices  T  S  Z(n_1)xn 
that  map  n-dimensional  algorithms  into  (n  -  2) -dimensional 
processor  arrays.  This  simplest  case  can  illustrate  and  give  an 
intuitive  understanding  of  different  aspects  of  the  conflict-free 
mapping  problem  so  the  reader  can  follow  the  general  dis¬ 
cussion  in  the  next  section  more  easily.  Practical  applications 
are  the  mapping  of  four-dimensional  convolution  algorithm 
at  bit  level  [26]  into  a  two-dimensional  systolic  array  and 
the  mapping  of  the  three-dimensional  matrix  multiplication 
algorithm  into  a  linear  systolic  array  [23). 


Let  n  €  Zlxn,  S  e  Z(n-2>xn,  and  rank(S)  =  n  -  2. 
Consider  the  following  equation- 


7’7  =  0  or 


0. 


(3.1) 


Let  us  first  assume  that  rank(T)  =  n- 1.  Later  in  this  section, 
conditions  on  II  are  given  to  guarantee  that  rank(T)  =  n  -  1. 
Clearly,  there  is  only  one  linearly  independent  solution  of 
(3.1).  Without  loss  of  generality,  let  T  =  [B,b\  where  B 
contains  the  first  n  - 1  columns  of  matrix  T,  rank(f?)  =  n- 1. 
and  b  is  the  last  column  of  T.  Also,  let  B"  and  det  B  be  the 
adjugate  matrix  and  determinant  of  matrix  B,  respectively  [18, 
p.  170].  Then  ail  solutions  of  (3.1)  can  be  expressed  as 


7  =  A 


B’b ' 
det  B 


=  A 


(3.2) 


where  A  is  a  constant. 

If  the  first  nonzero  entry  of  a  conflict  vector  is  assumed 
to  be  positive  (this  implies  no  loss  of  generality),  then  for 
the  mapping  matrix  T  6  Z(n-1)xn,  there  is  only  one  unique 
conflict  vector  (otherwise,  -7  would  also  be  a  conflict  vector). 
This  unique  conflict  vector  7  is  expressed  by  (3.2)  where  A  is 
such  that  7  is  integral,  its  entries  are  relatively  prime,  and  the 
first  nonzero  entry  is  positive.  According  to  Theorem  2.2  if 
this  unique  conflict  vector  is  feasible,  then  the  corresponding 
mapping  is  conflict-free.  In  addition,  if  II  is  such  that  there 
exists  a  nonzero  entry  /i0ri,---,Jr»),l  <  *  <  n,  then 
rank(T)  =  n - 1  because  fi(ic  1,  •  •  •  ,xn)  is  the  determinant  of 
the  submatrix  of  T  consisting  of  all  columns  except  the  tth  one 
of  T.  These  facts  are  summarized  in  the  following  theorem. 

Theorem  3.1  (Necessary  and  sufficient  condition  1):  Let  7 
be  defined  in  (3.2)  where  the  constant  A  in  (3.2)  is  such  that 
7  is  integral,  its  entries  are  relatively  prime,  and  the  first 
nonzero  entry  is  positive.  Then  mapping  matrix  T  €  Z(n~l)  x  n 
is  feasible  if  and  only  if  vector  7  is  feasible.  The  rank  of 
matrix  T  is  n  —  1  if  and  only  if  there  exists  a  nonzero  entry 
/<(*■ i,- •••.*»),  1  <  i  <  n. 

Proof:  First,  it  is  shown  as  follows  that  there  is  only  one 
conflict  vector  if  the  first  nonzero  entry  of  the  conflict  vector  is 
assumed  to  be  positive.  Suppose  there  are  two  conflict  vectors 
71  and  72  whose  first  nonzero  entries  are  positive.  Because 
there  is  only  one  linearly  independent  solution  of  (3.2),  71 
and  72  are  linearly  dependent  Thus,  72  =  C7i,  where  c  is  a 
constant.  If  c  =  1,  then  71  =  72;  if  c  =  —  1,  then  the  first 
entry  of  one  of  the  vectors  is  not  positive;  if  c  is  a  nonintegral 
rational  number,  then  72  is  nonintegral  because  the  greatest 
common  divisor  of  entries  of  71  is  one;  and  if  c  >  1  is 
integral,  the  greatest  common  divisor  of  72  is  greater  than 
unity.  Therefore,  in  all  the  cases  discussed  above,  72  is  not  a 
distinct  conflict  vector  whose  first  entry  is  positive.  So  there 
is  only  one  such  conflict  vector  of  mapping  matrix  T  and. 
therefore,  T  is  feasible  if  and  only  if  7  is  feasible. 

It  is  trivial  to  show  that  if  rank(T)  —  n  —  1,  then  there  is  a 
nonzero  entry  /<(x  1, •••,*«)  because  otherwise,  rank(T)  = 
n.  Now  suppose  there  exists  a  nonzero  entry  /i(xlt  •  -  • .  xn). 


HANG  AND  FORTES:  TIME  MAPPING  OP  UNIFORM  DEPENDENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


355 


.et 

Bm 


B\l  ■■■  £(*-l),l 

B\2  B22  ••• 

B\,(n-l)  B?,(  n-l)  £(n-l),(n-l) 


(3.3) 


/here  Bij,  i,  j  =  1.  •  •  • ,  n  -  1,  are  the  cofactors  of  matrix  B 
18.  p.  165].  Clearly,  /,  =  [5^,  •  •  • .  B(n-i),,]6  =  Busi,n  + 

•  •  +  5(n-2),%^(n-2I,n  +  ^(n-l).»,rn,  1  <  i  <  (n  —  1).  With 

tile  thought,  it  can  be  seen  that  fi  is  the  determinant  of  matrix 
1  with  the  zth  column  being  replaced  by  b  [18,  p.  165]  which 
;  a  submatrix  of  T.  Therefore,  there  is  a  submatrix  of  T  whose 
cterminant  is  nonzero  which  means  rank(T)  =  n  —  1.  □ 

According  to  Theorem  2.2,  the  unique  conflict  vector  y  in 
3.2)  is  feasible  if  and  only  if  the  absolute  value  of  one  of 
s  entries  is  greater  than  a  certain  value.  Therefore,  given  a 
tapping  matrix  T,  to  see  if  it  is  conflict-free  or  not,  (3.2) 
an  be  used.2  Later  in  Section  V,  (3.2)  is  used  to  formulate 
roblem  2.2  as  an  integer  programming  problem.  If  functions 
,  in  (3.2),  i  =  1,  ■  •  • ,  n,  are  linear,  then  the  formulation 
i  possibly  an  integer  linear  programming  problem.  In  the 
blowing,  it  is  shown  that  if  the  space  mapping  matrix  5  is 
iven,  functions  fi  in  (3.2),  i  =  1.  •  •  • ,  n,  are  linear  functions 
f  Xj ,  j  =  1.  •  •  • .  n. 

Proposition  3.2 :  Functions  i  =  1.  •  •  • ,  n  in  (3.2)  are 
near  functions  of  Xj,  j  =  1,  ■  •  • ,  n. 

Proof:  Let  3 *  be  defined  as  in  (3.3).  Clearly,  fi  = 
'hi, '  •  •  .  f?(«- l),t]b  =  Bu*l,n  +  ■  ■  •  +  #(«-2),|3(n-2),n  + 
),,xn,  1  <  t  <  (n  -  1).  Cofactors  Bti,  1  <  l  <  n  -  2, 
re  the  linear  functions  of  Xj,  j  =  l,---,n  —  1  because 
In  is  the  determinant  of  the  submatrix  of  B  obtained  from 
:moving  the  Ith  row  and  the  ith  column  of  matrix  B. 
hus,  Busi,n  are  linear  functions  of  x7,  j  =  1,  •••,«—  1. 
<> factor  B(n-i)t,  is  independent  of  itj,  j  =  1,  •  •  • ,  n  because 
is  the  determinant  of  the  submatrix  of  B  obtained  by 
:moving  the  uh  column  and  the  (n  —  l)th  row  which  is 
.  Thus,  r„  is  a  linear  function  of 

...  Therefore,  i  =  1.  •  •  • ,  n  -  1  are  linear  functions  of 
, .  j  =  1,  •  •  • ,  n.  The  last  entry  f„  =  det  B  is  also  a  linear 
inction  of  x;,  j  =  1.  •  •  • ,  n  -  1  because  it  is  the  determinant 
f  matrix  B.  □ 

Example  3.1:  Consider  algorithm  ( J.  D)  used  in  [23]  where 
.pendence  matrix  D  and  index  set  J  are  as  follows. 


'l  0  0‘ 
0  1  0 
0  0  1 


J=  {]  ■■]  e  z\0<jt  »  =  !.•••, 3}. 


(3.4) 


ctually,  this  uniform  dependence  algorithm  can  model  the 
atrix  multiplication  algorithm.  How  the  dependence  matrix 
id  the  index  set  are  derived  from  the  Fortran  code  of  the 
atrix  multiplication  algorithm  is  shown  in  [21],  [23],  and 
2|.  Let  the  matrix  multiplication  algorithm  compute  C  = 
B .  where  C  -  [c^],  .4  =  [ay],  and  B  =  [6^j.  By  [23], 
pendence  vectors  d\,  d?,  and  ds  are  induced  by  B,  .4,  and 

respectively.  In  otheT  wfrds.  the  computation  indexed  by 

The  unique  conflict  vector  (a  vector  in  the  nuflspece  of  T)  can  be  obtained 
■  niter  method*  such  as  Gauxaun  elimiaatm. 


]  needs  data  A  from  index  point  ]  -  d-i,  B  from  index  point 
j  —  d\,  and  C  from  index  point  j  —  dj.  If  the  space  mapping 
matrix  is  chosen  as  the  one  used  in  [23]  5  -  [1, 1,  -1],  then 
mapping  matrix  T  and  its  conflict  vector  y  are  as  follows. 


r  1  -ii  .  r-x2-x3‘ 

T=  ,  7  =  A  xt  +  x3  . 

Xt  X2  X3 

J  L  Xi  -  X2  _ 


It  is  clear  that  Ty  =  0.  If  II  is  chosen  such  that  -x2  -  x3  ^  0 
or  xi  +  x3  ^  0  or  xi  -  x2  ^  0,  then  rank(T)  =  n  -  1  =  2.  □ 
Example  3.2:  Consider  another  algorithm  (J.  D)  used  in  [22] 
where  dependence  matrix  D  and  index  set  J  are  as  follows. 


'0  0  1  1  1  ' 

£>=0  1-1-1  0 
10-1  0  -1 

J  =  {}  :  3  €  Z3,0  <  ji  <  p,i  =  1.  •  •  •  ,3}.  (3.6) 


This  uniform-  dependence  algorithm  can  model  the  reindexed 
transitive  closure  algorithm.  How  the  dependence  matrix  and 
the  index  set  are  derived  from  the  Fortran  code  of  the  transitive 
closure  algorithm  is  shown  in  [17]  and  [23].  If  the  space 
mapping  matrix  is  chosen  as  the  one  used  in  [22]  S  =  [0. 0,  lj, 
then  mapping  matrix  T  and  its  conflict  vector  y  are  as  follows. 


(3.7) 


It  is  clear  that  Ty  =  0.  If  II  is  chosen  such  that  x2  ys  0  or 
x3  jk  0  then  rank(T)  =  n  -  1  =  2.  □ 


IV.  General  Case— Necessary  and  Sufficient 
Conditions  for  Conflict-Free  Mappings 


This  section  discusses  and  presents  the  solution  to  the 
general  case  of  Problem  2.1,  i.e.,  it  provides  necessary  and 
sufficient  conditions  for  conflict-free  mappings  where  n- 
dimensional  algorithms  are  mapped  into  {k  -  l)-dimensionai 
processor  arrays.  In  these  mappings,  T  6  Zkxn,  T  =  j^j, 

II  G  Zlxn  and  5  €  Z(fc*1)xn. 

Consider  the  equation 


Ty  =  0  or 


(4.1) 


If  rank(T)  =  k,  then  there  are  n  -  k  linearly  independent 
solutions  of  (4.1).  Let  7i ,  •  •  • .  y be  the  linearly  independent 
integral  solutions  of  (4.1),  whose  entries  are  relatively  prime, 
then  ail  solutions  y  of  (4.1)  can  be  represented  as  linear 
combinations  of  the  n—k  linear  independent  vectors  as  follows 


7  =  Aj7j  i - -i-  An_fc7n_fe.  (4.2) 


Clearly.  are  conflict  vectors  of  T. 

In  general,  the  mapping  matrix  T  has  more  than  n  -  k 
conflict  vectors  when  k  <  n-l  because  a  linear  combination 
of  these  n  -  k  conflict  vectors  may  be  a  different  integral 
vector  whose  entries  are  relatively  prime  and  therefore  another 
conflict  vector  of  T.  This  new  conflict  vector  may  or  may  not 
be  feasible.  Thus,  unlike  the  mapping  matrix  T  €  Zfr*"1)xn 


356 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL.  3.  NO.  3.  MAY  1992 


described  in  Section  III,  it  is  not  guaranteed  that  all  conflict 
vectors  of  T  are  feasible  even  if  the  n  -  k  linearly  independent 
solutions  7 i,  i  =  1,  •  •  • ,  n  —  k,  of  equation  Ty  =  0  are  all 
feasible.  This  is  illustrated  by  the  following  example. 

Example  4.1:  Consider  the  four-dimensional  algorithm  of 
Example  2.1  and  the  mapping  matrix  T  in  (2.S).  Let  71  = 
[0. 1,  -7, 0jT  and  72  =  [7,  -1. 0. 0]r.  Clearly,  T71  =  T72  = 
0,  ^1,  and  72  are  linearly  independent,  and  they  are  feasible 
conflict  vectors  of  T.  Let  7  =  1/77!  ■+• 1/772  =  [1,0.  -l,0]r. 
Vector  7  is  also  a  solution  of  equation  T7  =  0  and  its  entries 
are  relatively  prime.  8y  Definition  23  and  Theorem  2.2,  7 
is  a  nonfeasible  conflict  vector  of  T.  Therefore,  as  mentioned 
above,  for  a  given  mapping  matrix  T  6  Zkxn  with  k  <  n  —  1, 
there  are  possibly  more  than  n  —  k  conflict  vectors,  and  T  may 
not  be  conflict-free  even  if  there  a  re  n—k  linearly  independent 
feasible  conflict  vectors  of  T.  □ 

From  Example  4.1,  ai»  interesting  observation  is- that  one 
difficulty  in  making  all  conflict  vectors  of  mapping  matrix  T 
feasible  is  that  nonfeasible  conflict  vectors  can  result  from 
rational  linear  combinations  of  the  n  -  k  linearly  independent 
feasible  conflict  vectors  71,  •  •  • ,  yn-k  like  7  =  I/771  + 1/772 
in  Example  4.1.  Let  us  consider  another  way  to  select  the  n  -  k 
linearly  independent  conflict  vectors  of  T  such  that  constants 
A,,  i  =  l.  ■  •  • ,  n  -  k  in  (4.2)  must  be  integral  in  order  for  7  to 
be  integral.  To  achieve  this,  the  notion  of  the  Hermite  normal 
form  is  introduced. 

Theorem  4.1  ( Hermite  normal  form  [29,  p.  45]):  Let  T  € 
Zkxn  and  rank(T)  =  k.  Then  there  exists  a  unimodular’ 
matrix  U  €  Znxn  such  that  TU  -  H  —  [£,0!(0  denotes  a 
zero-entry  matrix)  where  L  €  Zkxk  is  a  nonsinguiar  and  lower 
triangular  matrix.  Matrix  H  is  called  the  Hermite  normal  form 
of  T. 

The  definition  of  the  Hermite  normal  form  used  here  is 
slightly  different  from  the  one  used  conventionally  and  in  [29], 
where  each  diagonal  element  of  matrix  L  is  required  to  be 
positive  and  be  the  maximum  of  all  absolute  values  of  elements 
in  that  same  row.  This  is  because  for  the  purpose  of  this  paper 
it  is  enough  to  know  that  matrix  T  can  be  transformed  into 
a  lower  triangular  matrix  [L,  0],  by  right  multiplication  of  a 
unimodular  matrix  U. 

For  a  given  mapping  matrix  T,  let  H  be  the  corresponding 
Hermite  normal  form  and  T  =  HV  where  V  =  U~l, 
U  =  [uj,  --,un]  and  V  =  [ui, ■  •  • . 5n].  Then  (4.1)  can  be 
rewritten  as  HVy  =  0.  Let  3  =  Vy  =  \0\,  ■  •  ■  •3n\r  and 
7  =  U0.  Then  the  following  statements  are  true. 

Theorem-  4.2 : 

1)  H3  =  0  if  and  only  if  0\ ,  •  •  • .  3k  are  zero. 

2)  Vector  7  is  integral  if  and  only  if  3  is  integral. 

3)  Vector  7  is  a  conflict  vector  of  mapping  matrix  T  if  and 
only  if 


=  [Ufc+1,---tU„ 


3kf  l 
3n 


(4.3) 


where  3i,  i  =  k  +-  1.  •  •  • ,  n,  are  arbitrary  integers  which 
are  not  all  zero  and  are  relatively  prime. 

3  A  matrix  is  unimodular  if  and  only  if  it  is  integral  and  the  absolute  value 
of  its  determinant  is  one. 


Proof:  1)  Because  H  =  [ L ,  0]  and  H0  —  L[0\ ,  ■  •  • ,  0k\T 
where  L  €  Zkxk  is  a  nonsingular  lower  triangular  matrix, 
0\,---  ,0k  are  zero  if  and  only  if  H 0  =  0. 

2)  By  definition,  a  matrix  is  unimodular  if  and  only  if  it  is 
integral  and  the  absolute  value  of  its  determinant  is  unity.  So  U 
is  unimodular  means  that  matrix  V  =  U~l  is  also  unimodular. 
Therefore,  7  is  integral  implies  that  0  is  integral  and  vice  versa. 

3)  By  Theorem  4.2  1)  and  2),  all  integral  solutions  7  of 

equation  T7  =  0  are  represented  by  (4.3)  where  0i,  i  — 
trl,-..,n,  have  to  be  arbitrary  integers  because  nonintegral 
values  oi  0i,  i  =  k  + 1.  •  -  • ,  n  result  in  a  nonintegral  vector  7. 
Next  it  is  shown  that  the  greatest  common  divisor  of  0i,  i  = 
1.  •  •  • ,  n  is  unity  if  and  only  if  the  greatest  common  divisor  of 
7i,  i  =  1,  •  ■  • .  n  is  unity.  Suppose  gcd(/?i,  ■  •  •  ,0n)  =  1  and 
gcd(7t,  •  •  •  ,7„)  =  c  >  1.  Then  7  =  07'  where  7'  is  integral 
and  its  entries  are  relatively  prime.  Because  0  =  V7  =  cVV 
where,  obviously  V~f'  €  Zn,  the  greatest  common  divisor 
of  0i,  i  =  l,---.n  is  at  least  c  >  1.  This,  is  contrary 
to  the  assumption.  So,  the  greatest  common  divisor  of  0lt 
i  =  1,  •  •  • ,  n  is  unity  implies  the  greatest  common  divisor  of 
7 i,  i  =  1.  •  •  •  ,n  is  unity.  With  similar  reasoning,  the  reverse 
can  be  shown.  Therefore,  the  0i,  i  =  k+l,  •  •  • ,  n,  in  (4.3)  have 
to  be  relatively  prime  integers,  otherwise  the  greatest  common 
divisor  of  entries  of  vector  7  is  greater  than  one,  which  is  not 
a  conflict  vector  by  Definition  2.3.  □ 

What  Theorem  4.2  implies  is  that  all  conflict  vectors  of  map¬ 
ping  matrix  T  can  be  represented  by  (4.3)  where  0k+i,  ■■■  -  3n 
are  arbitrary  integers  which  are  not  all  zero  and  are  relatively 
prime.  Notice  that  a  nonintegral  value  of  any  one  of  the 
0k+ir  -0n  results  in  a  nonintegral  vector  7  according  to 
Theorem  4.2.  So  in  this  representation,  the  case  where  a  new 
conflict  vector  of  T  can  be  obtained  by  a  nonintegral  linear 
combination  of  the  n  —  k  linearly  independent  solutions  of 
(4.1)  is  avoided. 

Example  4.2:  The  Hermite  normal  form  of  the  mapping 
matrix  T  in  (2.8)  is 


where 


TU  =  H  = 


0  0  0 
-10  0 


■1 

-1 

-1 

-7- 

*1 

7 

1 

1- 

0 

0 

0 

1 

and  V  =  U~l  = 

0 

0 

0 

1 

0 

0 

1 

0 

0 

0 

1 

0 

.0 

1 

0 

0  . 

.0 

1 

0 

0. 

All  conflict  vectors  of  T  are  the  integral  combinations  of  the 
third  and  fourth  columns  of  matrix  U  as  follows: 


•  _  1  _7- 

1 

0  1 

7V 

1  0 

A. 

.  0  0  . 

1 

where  03  and  04  are  integers  which  are  not  all  zero  and  are 
relatively  prime.  □ 

So  far,  a  better  representation  of  all  conflict  vectors  of  T 
has  been  found  which  requires  integral  combinations  of  n-  k 
linearly  independent  conflict  vectors  of  mapping  matrix  T. 
However,  it  is  still  not  guaranteed  that  ail  conflict  vectors 
are  feasible  unless  matrix  U  satisfies  some  conditions-  The 


SHANO  AND  FORTES;  TIME  MAPHNG  OF  UNIFORM  OEFGNOENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


357 


following  five  theorems  describe  these  necessary  and  sufficient 
conditions  for  mapping  matrix  T  to  be  conflict-free. 

Theorem  4.3  (Necessary  condition  2):  Let  Vij  be  the  entry 
of  matrix  V  at  the  ith  row  and  the  jth  column.  If  mapping 
matrix  T  is  conflict-free,  then  at  least  one  of  the  first  k  entries 
of  each  and  every  column  of  V  must  be  nonzero,  that  is,  the 
following  condition  holds. 


(till  0  V  v-21  ^  0  V  •  •  •  V  Ufci  jk  0)A 
(v\2  yf-  0  V  Vyi  yk  0  V  •  •  •  V  Vie2  /  0)A 


(«ln  ^  0  V  t/2n  5*  0  V  •  •  •  V  U**  0).  (4.4) 

Proof:  Let  7  be  an  arbitrary  conflict  vector  of  mapping 
matrix  T.  If  T  is  conflict-free,  then  7  is  feasible.  Feasible 
conflict  vector  7  has  at  least  two  nonzero  entries  7>  ^  0  and 
7j  ykO ,ijk  j.  Otherwise,  7  has  only  one  nonzero  entry  which 
is  unity  because  the  greatest  common  divider  of  entries  of  7 
is  required  to  be  one.  Such  a  conflict  vector  is  not  feasible. 
Next,  it  is  shown  that  7  has  at  least  two  nonzero  entries  if  and 
only  if  (4.4)  holds. 

(<=).  Suppose  that  7  has  only  one  nonzero  entry,  then 
■v  =  [0.  •  •  •  ,0, 1,0,  •  •  •  ,0JT.  By  assumption,^  =  V7  =  Uj, 
the  ith  column  of  matrix  V.  According  to  (4.4),  there  exists 
a  nonzero  element  vu  6  {vu,  ■  ■  ■  ■ ««}  which  means  0i  £  0, 
l  <  l  <  k.  This  is  contrary  to  Theorem  4.2  1)  that  Hi  —  0, 
i  =  I,  •  •  ■ .  k.  Therefore,  7  has  at  least  two  nonzero  entries. 

(=>).  Suppose  there  exists  a  column  u,-  of  matrix 
V  whose  first  k  entries  are  all  zero,  then  = 
'0, ••• . 0, Vfc+i.i, •  •  • , uni]r.  Let  0  =  i)i.  Because  the  first 
k  entries  of  t/<  are  zero,  H0  =  0  and  7  =  V~l$  is  a 
conflict  vector  of  mapping  matrix  T.  However,  7  =  = 

[0,  •  •  • ,  0, 1. 0,  ■  •  • ,  0]T  whose  entries  are  all  zero  except  that 
the  ith  entry  is  unity.  So,  mapping  matrix  T  has  a  conflict 
vector  with  only  one  nonzero  entry  which  is  contrary  to  the 
assumption.  This  means  (4.4)  holds.  □ 

Theorem  4.4  ( Necessary  condition  3):  If  mapping  matrix  T 
is  feasible,  then  um-i.  •  •  • ,  u«  are  feasible  conflict  vectors. 

Proof:  By  definitions,  Tu*  =  (5,  i  —  k 4- 1, •  •• , n.  Thus, 

■  •  ■ ,  Un  are  conflict  vectors  of  T.  T  is  feasible  implies 
~ti,i  =  k  +  1,  •  •  • ,  n.  must  be  feasible.  □ 

Theorem  4.5  (Sufficient  condition  4):  Mapping  matrix  T  is 
:onflict-free  if  the  following  conditions  are  met. 

There  exist  it,--  €  {1,  •  •  •  ,n}  such  that 

1) 


2) 


gcd(uj, 

■fc-M.U i, 

2r 

•  ’  '  .  »*. 

,n)  >  Mil 

gcd(tn. 

t 

•  •  •  • 

,n)  >  Mt* 

gcd(U4, 

—  *  » 

,*+2> '  '  ■ 

'  U,',_k,n)  > 

r  ^ 

1  ,k+\ 

,fc+ 2 

Uti  ,n 

det 

U<,.*+ 1 

w, 

,fc+2 

U»j,n 

L«*n 

— 1 

»,*+2 

Uin-h.n 

#0. 


Proof:  Let  7  be  an  arbitrary  conflict  vector  of  mapping 
matrix  T  represented  by  (4.3)  where  0k+ 1 ,  •  •  • ,  A»  are  arbitrary 
integers  which  are  not  ail  zero  and  are  relatively  prime. 
Because 


det 


Ui,>+2 

Utj.lc+2 


TTti.n 

Ui*,n 


¥=■  0 


and  0k+i,-  ■■  ,3n  are  not  ail  zero,  there  exists  i 
{ij,  •  •  • ,  in-*}  such  that 


€ 


[Ui.fc+l,  Kt,*+2,  •  •  •  -  Ut.n] 


0k+l 

0k+2 

Pn  J 


=  7.  5*  0. 


(4.5) 


According  to  condition  1),  gcd(u,.*+1, =  at,-  > 
Mi  +  1.  If  1 7,  |  <  Mt  +  1,  then  a*  does  not  divide  7i  which 
means,  according  to  [27],  (4.5)  has  no  integral  solution  and 
0k+i,  are  not  ail  integraL  Thus,  7  is  not  integral  and 

not  a  conflict  vector.  Therefore,  it  must  be  (7^  |  >  Mi  +  1. 
By  Definition'  23  and.  Theorem  2.2,  7  is  feasible  and  T  is 
conflict-free.  □ 

Theorem  4.6  (Sufficient  condition  5  for  T  6  Z(n_2)xra): 
Mapping  matrix  T  €  z(-n~2)xn  is  conflict-free  if  the  following 
conditions  are  met. 

1)  There  exists  i  6  {1.  •  •  ■ . n}  such  that  gcd(u^.n_l,  u^,n) 
>  Mi  +  1- 

2)  Let  Pn-i  and  0n  be  relatively  prime  integers,  not  both 

zero  and  such  that  3n-\^i.n-i  +  (3nUi,n  —  0.  there 
exists  j  e  {1, •  •• , n},  j  ^  i,  such  that  + 

PnUj.n |  >  Mj* 

Proof:  Consider  all  integral  values  for  0n-i  and  pn 
which  are  not  both  zero,  are  relatively  prime,  and  /3„_iUi.„_i 
+  PnUi.n  0.  Let  the  corresponding  conflia  vectors  be 
7.  Because  74  £  0  and  gcd(ui.„_i,Ui.n)  =  at  >  Mi  +  1, 

1 7*  I  >  Mi  ■+■  1.  Otherwise,  cti  does  not  divide  7 <  and  equation 
Pn-itti.n-i  +  0T ,Ui,n  =  7 i  has-  no  integral  solution  [27J. 
Therefore,  for  these  values  of  0n~i  and  0n,  the  corresponding 
conflict  vectors  are  feasible.  Now  let  us  consider  the  integral 
values  for  0n-i  and  0„  which  are  not  both  zero,  are  relatively 
prime  and  0n-iu «,»-t  +  PnUi.n  =  0-  Pot  the  corresponding 
conflict  vector  7,  because  there  exists  j  €  {1,  •  •  • ,  n },j  ^  i, 
such  that  |/?n_luJ-,n_1  +/?„u>,n|  >  Mj,  |7j|  is  greater  than  Mj 
and  7  is  feasible.  Therefore,  mapping  matrix  T  is  conflict-free 
because  all  of  its  conflict  vectors  are  feasible.  □ 

Theorem  4.7  (Sufficient  condition  6  for  T  6  Z('n~2',xn): 
Mapping  matrix  T  €  Z(n-2,xn  is  conflict-free  if  the  following 
conditions  are  met. 

1)  There  exists  t  €  {1.  •  •  • .  n}  such  that  u,-.n-i  -  m,n  >  0 
and  |u,,n— 1  4- 1*,„|  >  Mi* 

2)  There  exists  j  €  {1.  •  •  • .  n}  such  that  «j>_i  •  u;,„  <  0 
and  |uj,„_i  -u>,nl  >  Pi- 

3)  in  -  1  and  tL,  are  feasible  conflia  vectors. 

Proof:  Because  T  €  Z{n~2)xn,  its  conflia  vectors  are 
described  by  (4.3)  where  k  =  n  -  2.0r,-l  and  0„  are 
arbitrary  integers  that  are  not  both  zero  and  are  relatively 
prime.  Suppose  conditions  1),  2\  and  3)  are  met  Let  us 


358 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL.  3.  NO.  3.  MAY  1992 


consider  the  case  where  one  of  /3„_i  and  /?„  is  zero.  Then  the 
corresponding  conflict  vector  7  equals  either  u„_i  or  tL»  and  is 
feasible  in  either  case  because  of  condition  3).  Let  us  consider 
the  second  case  where  ^  0  and  (Sn  56  0  have  the  same 
sign  and  the  corresponding  conflict  vector  7.  By  condition  1), 
there  exists  i  €  {1,  •  •  • ,  n}  such  that  <ijn  have  the  same 

sign  and  >  m.  Because  u„_t  has  the  same  sign 

as  a*  and  (ln-i  has  the  same  sign  as  lin,  |7i|  =  |/?n-iu«,n-i  + 

=  l/3n-lUi,„_l|  +  |/3nU*.„|  >  +  ««,n|  >  Pi- 

Therefore,  7  is  feasible.  Now  let  us  consider  the  third  case 
where  Sn-\.  £  0  and  3n  #  0  have  opposite  signs  and 
the  corresponding  conflict  vector  7.  According  to  condition 
2),  there  exists  j  6  {1, •••,«}  such  that  Uj,n_i,tij,n  have 
opposite  signs  and  -  «j>|  >  Pj-  Because  and 

0n  have  different  signs,  |7,|  =  |/3n-1«i.n-i  4-  Pnuhn\  > 
|tij,n_i  -  tij,n|  >  pj.  Therefore,  7  is  feasible.  □ 

Finding  a  closed  form,  necessary  and  sufficient  condition 
on  conflict-free  mappings  for  general  case  is  still  open.  The 
following  discussion  oit  the  mapping  matrices  T  6  Z(~n~2)xn 
might  provide  some  insight  why  this  problem  is  difficult.  Let 
U[  denote  the  ith,  1  <  i  <  n,  row  of  the  n  x  2  matrix 
which  has  the  last  two  columns  of  U  as  its  columns.  Let 
C  =  {(/3n-l,/*n]T  :  |tf'[/?«-l,/*n]T|  <  Mi,  i  =  1.  ■•.,»}.  If 
C  contains  integral  elements  whose  two  entries  are  relatively 
prime  and  not  all  zero,  then  the  corresponding  mapping  has 
computational  conflicts.  C  is  a  convex  polyhedron.  Thus, 
to  find  closed  form  necessary  and  sufficient  conditions  on 
conflict-free  mappings  can  be  reduced  to  the  closed  form 
conditions  to  test  if  a  convex  polyhedron  contains  integral 
elements  which  is  a  difficulty  open  problem  in  integer  pro¬ 
gramming  problem  area.  One  possible  approach  is  to  construct 
a  minimal  hypercube  which  contains  the  convex  polyhedron 
C.  Because  the  upper  and  lower  bounds  of  the  hypercube  in 
each  dimension  are  constant,  it  is  easier  to  test  if  this  hypercube 
contains  integral  elements  which  are  in  C.  Based  on  this  idea, 
instead  of  the  closed  form  necessary  and  sufficient  conditions, 
a  procedure  is  developed  to  test  if  a  mapping  matrix  is  conflict- 
free  or  not.  This  procedure  will  be  reported  in  a  separate  paper 
[37]. 

Multiplier  matrix  U  is  involved  in  most  of  the  necessary  and 
sufficient  conditions  for  conflict-free  mappings  and  the  key 
point  is  to  have  matrix  U  satisfy  certain  conditions  in  order 
to  make  the  mapping  matrix  T  conflict-free.  Given  a  mapping 
matrix  T  in  numbers,  there  are  several  numerical  methods  to 
compute  matrix  U  in  polynomial  time  [20].  When  the  entries 
of  T  are  symbols  (variables),  computing  matrix  U  becomes 
more  difficult.  However,  when  the  space  mapping  matrix  S 
is  known,  it  is  possible  to  express  matrix  U  as  a  function  of 
n.  In  [30],  matrix  U  is  expressed  as  a  function  of  II  for  the 
mapping  matrix  T  €  Z3xS. 

v.  Conflict-Free  Mappings 

Solutions  to  Problem  2.1  have  been  discussed  in  the  above 
sections.  In  this  section.  Problem  2.2  and  its  solutions  are 
discussed:  that  is,  how  to  find  the  optimal  vector  11°  which 
contributes  a  conflict-free  mapping  matrix  T  and  satisfies  some 
other  constraints.  A  procedure  is  presented  which  searches  the 


solution  space  intelligently  by  employing  the  basic  strategy 
in  [10]  and  [11].  The  integer  linear  programming  formulation 
plus  some  heuristic  is  used  to  find  optimal,  time  mappings 
for  the  matrix  multiplication  algorithm  and  the  reindexed 
transitive  closure  algorithm. 

By  Theorem  2.1,  the  total  execution  time  increases  mono- 
tonically  with  the  sum  of  the  weighted  absolute  values  of 
entries  of  vector  II.  Therefore,  to  find  the  solution  of  Problem 
2.2,  one  simple  way  is  to  modify  the  method  in  [10]  and  [11] 
where  candidates  are  enumerated  in  an  increasing  order  of  their 
total  execution  times.  In  the  modified  method,  besides  other 
considerations,  each  candidate  should  be  checked  to  see  if  it 
contributes  a  conflict-free  mapping.  This  method  is  described 
in  the  following  procedure. 

Procedure5.1  (Finding  optimal  solution  II0  of  Problem  2.2): 

Input:  Algorithm  ( J,  D)  and  the  space  mapping  S  e 

£(*-1  )xn 

Output:  An  integral  row  vector  II0  which  is  the  solution  of 
Problem  2.2. 

Step  1:  Set  /  =  1.  Co  =*  0. 

Step  2:  Construct  a  candidate  set  Q  =  {II :  £1^1  \*i\Pi  < 
x,  6  iV+}.  C  -  C,  -  Ci  D  C,_!.  Let  \C\  =  c. 

Step  3:  Sort  and  assign  indices  to  all  candidates  in  C  such 
that  t(IIi)  <  f(II,),0  <  i  <  j  <  c,  where  t(II)  is 
the  function  described  in  (2.7). 

Step  4:  Set  i  =  1. 

Step  5:  Consider  candidate  IL;  set  T*  =  ^  j  and  check 
if  Ilj  meets  the  following  conditions 

a)  ED  >  0. 

b)  rank(T’)  =  k. 

c)  T*  is  conflict-free. 

d)  SD  =  PK  and  kji  <  IIcM  =  l,---,m, 
where  P  and  IT  are  as  defined  in  Definition  2.2. 

Step  6:  If  n<  meets  all  the  three  conditions  above,  then 
IF  =  n<  and  stop.  If  Ilj  does  not  meet  any  one 
of  the  conditions  above,  ignore  this  candidate,  set 
i  =  i  4-  1. 

Step  7:  If  t  >  c,  set  l  =  l  +  l.xj  =  xi  +  at.  a  €  and 
go  to  Step  2.  If  i  <  c,  go  to  Step  5. 

To  test  if  T1  is  conflict-free  or  not  in  step  5,  for  map¬ 
ping  matrices  T  S  Z(n-1)xn,  the  necessary  and  sufficient 
conditions  in  Theorem  3.1  can  be  used.  For  other  mapping 
matrices,  either  these  necessary  conditions  in  Section  IV  or 
the  testing  procedure  in  [37]  can  be  used.  The  computational 
complexity  of  Procedure  5.1  is  estimated  at  least  9((2/u+  1)") 
where  p  =  min{/ii  :  i  =  1. •••,«}.  This  is  justified  as 
follows.  Consider  one  conflict  vector  7  of  T  and  let  5,,  i  = 

I.  •  •  • ,  k  —  1  be  the  sum  of  the  absolute  values  of  entries  of 
the  ith  row  of  space  mapping  matrix  S.  As  indicated  in  [37], 

>  ((max{|7i|,  ■  •  • .  |7n|})n”*)/(n<'=il  $i)'  For  ^ 
to  be  feasible,  at  least,  max{|7t|,---.|7n|}  >  p  +  1.  Let 
flij/  Si  =  c  be  constant.  Then  |x<|  >  pn~k/c  must 
be  satisfied.  Thus,  some  of  the  possible  values  Procedure  5.1 
at  least  should  try  for  i  =  1.  •  •  • . n  are  0,  ±1.  • .  ±p. 

Because  for  each  entry  rr,,  1  <  i  <  n,  there  are  (2p  +■  1) 
possible  values  to  consider  and  there  are  n  entries  in  vector 

II,  a  lower  bound  for  the  complexity  of  Procedure  5.1  is 


SHANG  AND  FORTES:  TIME  MAPPING  OF  UNIFORM  DEPENDENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


359 


6((2m  4-  1)").  Usually,  candidate  II  where  the  absolute  values 
of  Tj,  i  =  l,-  -,n,  are  small,  e.g.,  II  =  [1,1,1]  for  the 
matrix  multiplication  algorithm,  does  not  contribute  a  conflict- 
free  mapping.  So  more  sophisticated  methods  of  finding  the 
olution  of  Problem  2.2  may  be  possible.  In  general,  together 
with  Theorem  2.1,  these  necessary  and  sufficient  conditions 
described  in  Theorems  3.1,  4.5,  4.6,  and  4.7,  depending  on 
the  dimension  of  the  mapping  matrix  T ,  should  be  used  to 
guide  the  solution  search  which  is  under  investigation.  In 
the  following,  the  integer  linear  programming  approach  is 
discussed. 

For  the  mapping  matrix  T  6  £(n-i)xn>  Problem  2.2  can  be 
formulated  as  an  integer  programming  problem  as  follows. 

n 

min  /  =  ^  \vi\pi  (5.1) 

>=i 

1)  UD  >  0 

2)  SD  =  PKand 

subject  to  <  i  =  1,  •  •  • ,  m 

3)  3 i  €  {1,  •  •  ■ ,  n},  !/»(*!»■  •  •  -  Tn)|  >  Mi 

U)  Il€  Zlxn 

(5.2) 

n  ,  5,  and  P  are  given  and  /,,  i  =  1.  •  •  • .  n 
in  (3.2).  The  last  two  constraints  required  by 
Definition  2.2  are  included  implicitly  in  (5.2)  because  by 
Theorems  2.2  and  3.1  they  are  implied  by  constraint  3  in 
1 5.2).  Also,  constraint  2  is  not  required  if  a  new  processor 
array  is  designed  specially  for  the  algorithm.  By  Proposition 
3.2,  constraint  3  in  (5.2)  is  linear.  So  the  formulation  in 
(5.1)  and  (5.2)  is  an  integer  linear  programming  problem  if, 
as  in  Examples  5.1  and  5.2,  constraint  1  in  (5.2)  requires 
that  7T,  >  0,  i  =  1.  ■■■.n.  Actually,  this  integer  linear 
programming  problem  can  be  further  converted  to  a  linear 
programming  problem  for  some  applications  as  indicated  by 
Examples  5.1  and  5.2.  This  is  because  in  these  applications, 
the  objective  function  is  linear,  and  all  extreme  points  of  the 
solution  sets  are  integral. 

One  more  constraint  gcd(/i,  ■•■•/„)  =  1,  where  /,,  i  = 
l.  -.n  are  as  defined  in  (3.2),  should  be  added  to  the 
formuladon  described  by  (5.1)  and  (5.2)  to  guarantee  the 
greatest  common  divisor  of  the  resulting  conflict  vector  is 
unity.  However,  this  makes  the  problem  more  difficult  to  solve. 
Hence,  this  constraint  is  ignored  and  the  feasibility  of  the 
conflict  vector  of  the  resulting  solution  is  checked.  In  other 
words,  the  conflict  vector  may  not  feasible  after  the  common 
factor  of  its  entries  is  removed  as  indicated  in  Example  5.1. 
If  all  extreme  points  which  have  the  minimum  value  of  the 
objective  function  in  (5.1)  are  not  feasible,  i.e.,  they  do  not 
contribute  conflict-free  mappings,  then  some  heuristic  analysis 
has  to  be  used  to  find  the  optimal  solution.  In  the  Appendix, 
Example  5.1  illustrates  this  situation. 

The  general  integer  programming  problem  is  NP-complete 
(29,  p.  245].  However,  for  each  fixed  natural  number  n.  there 
exists  a  polynomial  time  algorithm  which  solves  the  integer 
I  inear  programming  probiVm  [29,  p.  259].  This  algorithm  is  a 
polynomial  function  of  the  number  of  constraints  of  the  integer 
linear  programming  problem  and  a  logarithmic  function  of 


where  T  = 
ire  as  definei 


the  problem  size  variables  Mt.  i  =  1, Many  existing 
standard  algorithms  can  be  used  to  solve  the  formulation  in 
(5.1)  and  (5.2)  efficiently  because  :n  practice,  the  number  of 
constraints  and  the  algorithm  dimension  n  are  not  targe. 

Example  5.1:  Consider  the  matrix  multiplication  algorithm 
in  Example  3.1  where  the  space  mapping  matrix  is  given  as 
5  =  [1. 1.  — 1]  [23].  The  dependence  matrix  D  and  index  set 
J  are  shown  in  (3.4).  To  satisfy  condition  1  in  (5.2),  each 
entry  of  the  linear  schedule  vector  II  must  be  positive,  i.e., 
t,  >  1,  i  =  1.---.3.  Therefore,  the  problem  of  finding  an 
optimal  linear  schedule  vector  for  the  matrix  multiplication 
algorithm  is  formulated  as  an  integer  linear  programming 
problem: 

min  /  =  p(ir i  4-  t2  4-  t3)  (5.3) 


1)  Ti  >  l,i=  1,2,3 

2)  SD  =  PK  and 

^  *  =  1,  —  .3 

3)  Tz  4-  x3  >  ft  4-  1,  or  rri  4-  x3  >  m  4-  1,  or 
|xi  -ir2|  >  p  +  1 

4)  ns  zlx3 

where  the  inequalities  in  constraint  3  are  derived  in  Exam¬ 
ple  3.1  and  shown  in  (3.5).  As  indicated  in  the  Appendix, 
this  problem  can  be  converted  to  four  linear  programming 
problems.  In  the  Appendix,  techniques  for  solving  linear 
programming  problems  are  used  to  find  the  optimal  solution. 
There  are  at  least  three  potential  optimal  solutions  II2  = 
[l.M-l]  and  II3  =  [m,  1,1],  Hu  =  [1,2,m  -  1).  Their 
corresponding  conflict  vectors  are  s  =  [m  +  1.  -2,  m  ~  l]T. 
73  =  [2.  -(m  4-  1),  1  -  p]T,  and  7U  =  [(m  4-  1).  -p.  ljT, 
respectively.  When  p  is  an  odd  number,  II2  and  n3  are  not 
feasible  because  their  conflict  vectors  are  not  feasible.  So,  for 
the  cases  where  p  is  an  odd  number,  IIu  can  be  chosen  as  the 
optimal  solution  and  for  the  cases  where  p  is  an  even  number, 
ail  the  three  linear  schedule  vectors  can  be  the  optimal  solution. 
The  derivation  of  the  solution  is  shown  in  the  Appendix.  Let 
us  choose  11°  =  IIu  =  [1,2.m  -  1].  The  total  execution 
time  is  t  =  p( 2  4-  p)  +■  1  according  to  (2.7).  The  matrix  of 
interconnection  primitives  is  P  =  5  =  [1. 1,  -1]  and  K  —  I. 
the  identity  matrix.  Fig.  2  shows  the  block,  diagram  of  the 
linear  array  for  multiplying  two  4x4  matrices  (p  =  3).  Fig.  3 
shows  the  execution  of  the  matrix  multiplication  algorithm  by 

mapping  matrix  T  =  ^  *  2  '  Computation  Cji.j2  = 

+  ayi.j3  J>j3.}2  indexed  by  j  =  [71,  j2,  j3]r  is  executed 
at  processor  S]  and  at  time  II;.  By  inspecting  Fig.  3.  there 
are  no  computational  conflicts.  Also,  as  shown  in  Fig.  3  and 
explained  in  the  Appendix,  there  is  no  link  collision  either 
which  means  no  two  data  use  the  same  link  at  the  same  time. 
As  shown  in  Fig.  2,  three  data  links  are  used,  one  for  data  .4 
traveling  from  left  to  right,  one  for  data  3  traveling  from  left 
to  right,  and  one  for  data  C  traveling  from  right  to  left  Two 
buffers  are  needed  for  each  PE.  One  is  on  the  Link  for  data  A. 
or  for  dependence  vector  d2.  The  other  is  on  the  link  for  data 
C,  or  for  dependence  vector  d3.  □ 


360 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL  3.  NO.  3.  MAY  1992 


Fig.  2.  Block  diagram  of  the  linear  array  for  the  matrix  multiplication 

algorithm. 


In.  [23],  with  the  same  space  mapping  matrix  S,  linear 
schedule  vector  II'  =  [2. 1.  /a)  which  is  not  optimal  is  used  and 
the  corresponding  conflict  vector  is  y  =  [(p+l),  -2-p,  -l|r. 
The  total  execution  time  by  II'  is  t'  =  /a(3  4-  p)  -+•  1  which  is 
longer  than  the  optimal  linear  schedule  n°.  Also,  the  number 
of  buffers  in  [23]  is  -  1)  =  3  when  p  —  2.  The 

systolic  array  designed  in  this  paper  only  needs  two  buffers. 

Example  5.2:  Consider  the  transitive  closure  algorithm  in 
Example  3.2  where  the  space  mapping  matrix  is  given  as 
5  =  [0.0.  Ij  [22].  The  dependence  matrix  D  and  index  set  J 
are  shown  in  (3.6).  To  satisfy  condition  1  in  (5.2),  it  must  have 
>  0,  >  0,  and  xi  -  tr2  >  0,  or  ir{  >  rj  >  0.  This  means 

each  entry  of  the  linear  schedule  vector  II  must  be  positive. 
Therefore,  the  problem  of  finding  an  optimal  linear  schedule 
vector  for  the  transitive  closure  algorithm  is  formulated  as  an 
integer  linear  programming  problem: 

c  _  „r_.  a.  j_  -r,\  f5.4) 


1)  Xj  >  1,1  =  2,3,irt  -  jt2  -  x3  >  1, 

ir\  -  x2  >  1,  xt  -  it’s  >  1 

2)  SD  =  PK  and  _ 
ki*  ^  Hi,  i  =  1, •  •  •  ,3 

3)  t2  >  p  +  1.  onri  >  p  +  1 

4)  II  €  Zlx3 

where  constraint  3  is  derived  in  Example  3.2  and  shown  in 

(3.7) .  Again,  as  indicated  in  the  Appendix,  this  problem  can  be 
formulated  as  two  linear  programming  problems.  The  optimal 
solution  of  this  algorithm  is  11°  =  [p  +  1. 1. 1]  when  p  >2. 
The  derivation  of  the  solution  is  shown  in  the  Appendix. 
The  total  execution  time  is  t  =  p{2  4-  p)  +  1  according  to 

(2.7) .  As  mendoned  in  Secdon  I,  the  soludon  found  by  the 
heuristic  procedure  in  [22]  is  II7  =  [2p  +  1, 1, 1]  (notice  that 
the  lower  and  upper  bounds,  for  index  points  are  1  and  n  in 
[22]  and  therefore,  p  =  n  —  1  in  [22])  and  the  total  execution 
time  is  t'  =  p(2p  +  3)  +  1  which  is  much  longer  than  for 
n°.  The  matrix  of  interconnection  primitives  is  P  «  SD  = 
[1, 0,  -1. 0,  -1]  and  K  —  I,  the  identity  matrix.  Clearly,  there 
are  no  computational  conflicts  because  the  conflict  vector  of 

the  mapping  matrix  ®  ^  ®  J  is  7  =  [1,  -(p  +  1),  0JT 

which  is  feasible  according  to  Theorem  2.2.  Also,  as  explained 
in  the  Appendix,  there  is  no  link  collision  either  which  means 
no  two  data  use  the  same  link  at  the  same  time.  □ 

VI.  Conclusions  and  Future  Work 

The  first  contribution  of  this  paper  is  the  use  of  the  Hermite 
normal  form  of  the  mapping  matrix  by  which  all  conflict 
vectors  are  expressed  as  integer  combinations  of  a  set  of 
vectors  from  the  nuilspace  of  the  mapping  matrix.  Based  on 
the  Hermite  normal  form,  necessary  and  sufficient  conditions 
are  derived  for  computational  conflict-free  mappings.  The 
second  is  the  method  of  finding  optimal  time  mappings  of 
n-dimensional  algorithms  into  (n  -  2  [-dimensional  processor 
arrays.  These  techniques  can  be  used  to  map  algorithms  with 
n  nested  loops  into  linear  or  two-dimensional  processor  arrays 
with  the  total  execution  time  minimized;  they  are  especially 
useful  for  programming  bit  level  processor  arrays  such  that 
the  total  execution  time  is  minimized. 

Future  work  includes  optimal  solutions  of  the  general  map¬ 
pings  where  T  €  Zkxn  for  arbitrary  k.  consideration  of 
the  number  of  buffers  and  length  of  wires,  required  by  the 
mappings,  and  investigation  of  the  following  two  problems. 

Problem  6.1  (Space-optimal  and  conflict-free  mapping  prob¬ 
lem):  Given  an  n-dimensionai  uniform  dependence  algomhm 
and  a  linear  schedule  vector,  find  a  space  mapping  matrix 
S  S  Z(h~l)*n  such  that  T  =  j^]  is  conflict-free  and  the 
number  of  processors  plus  the  wire  length  of  the  array  is 
minimized. 

Problem  6.2  ( Optimal  conflict-free  mapping  problem ):  Given 
an.  n-dimensionai  uniform  dependence  algorithm  and  a  (fc- 1  )- 
dimensional  processor  array,  find  a  conflict-free  mapping 
matrix  T  €  Zkxn  such  that  a  certain  criterion  is  optimized. 

In  general,  in  Problem  22,  space  mapping  matrix  S  is  given 
and  usually  is  not  a  function  of  problem  size  variables  p\ 
i  =  1.  •  •  • .  n:  in  Problem  6.1.  linear  schedule  vector  II  is  given. 


HANG  AND  FORTES:  TIME  MAPPING  OF  UNIFORM  DEPENDENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


361 


ossibly  by  the  optimization  procedure  proposed  in  [16],  and 
sually  is  not  a  function  of  problem  size  variables;  and  in 
’roblem  6.2,  both  5  and  II  are  not  given  and  possibly  are 
>oth  functions  of  problem  size  variables. 


APPENDIX 

Discussion  of  Example  5.1 :  Let  us  design  a  new  linear 
vstolic  array  for  the  matrix  multiplication  algorithm.  Thus, 
onstraint  2  in  (5.3)  can  be  ignored  at  this  moment.  For  an 
nteger  linear  programming  problem  with  convex  solution  set, 
f  all  of  its  extreme  points  are  integral,  then  one  of  the  extreme 
loints  is  the  optimal  solution  of  that  problem  [29,  p.  232).  Now 
he  solution  set  of  the  integer  programming  problem  in  (5.3)  is 
iot  convex  because  of  constraint  3  although  ail  extreme  points 
ire  integral.  One  way  to  solve  this  problem  is  to  partition  the 
solution  set  as  four  convex  subsets  and  then  to  find  all  optimal 
solutions  for  ail  the  disjoint  solution  subsets.  If  the  one  with 
:  he  smallest  value  of  the  objective  function  is  satisfactory,  then 
a  is  the  optimal  solution  of  the  integer  programming  problem 
-n  (5.3). 

Now  let  us  partition  the  solution  set  of  the  integer  program¬ 
ming  problem  in  (5.3)  as  four  subsets  which  are  expressed  as 
ollows. 


1 1  min  /  =  (i(ti^T2  +  x 3 ) 

'  1)  t,  >  l.i  =  1.2.3 

21  ~2  —  X3  >  p  -+■  1 

.  3)  Tt  -  X3  <  p  -I-  1 

subject  o  -  ^  . 

■I)  Ti  -  T 2  <  U  +  l 

51  —  Tj  —  X2  <  p  +•  1 

.61  n  6  zlx3 

.7  min  /  =  *il  •+■  x2  x3) 
f  l)  t,  >  l.»  =  1.2.3 
subject  to  r  2)  x(  x3  >  p  +-  1 
1  3>  n  €  zlx3 


III)  min  /  =  u(  *1  x2  -1-  x3 ) 
f  1)  x,  >  l.i  =  1,2.3 
subject  to  f  2)  Tt  -  x2  >  p  -h  1 
1  3)  II  6  Zlx3 

(IV)  min /  =  p(iri  4-  x2  4-  x3) 
f  1)  >  l.i  =  1.2.3 

subject  to  <  2)  -  Tj  +■  x2  >  p  4- 1 
1  3)  II  €  Z1*3. 


Each  of  the  above  problems  is  air  integer  linear  programming 
problem  with  a  convex  solution  set.  It  can  be  checked  that 
ivery  extreme  point  of  these  convex  sets  is  integral.  Each 
extreme  point  is  the  solution  of  three  of  the  following  seven 
equations:  Ti  =  l,  x2  =  1,  x3  =  1.  x2-i-r3  =  ^+1,  xt  +x3  = 
4+ 1,  xt  -  x2  =  p+ 1.  and  x2  -  x1  =  p-b  1.  There  are  ten  such 
solutions  from  these  seven  equations  which  satisfy  HD  >  0 

as  follows,  rii  =  [i.i.  m),  n2  =  [i.p.  1],  n3  =  [p,  1, 1], 
fU  =  [1.  m  2. 1],  n3  =  [n  2. 1.  ij,  n6  =  [i.p  +  2.m1» 
n?  =  [ft  -j-  2.  i.p\,  n,t=  [ft,  p,  i),  1I9  =  [2/t  -(-1.3a.  ij, 

and  n10  =  [ft,  2u  ■¥  1.1],  Extreme  points  with  the  shortest 
execution  time  are  fli,  IT2,  and  n3.  ITi  is  not  feasible  because 


the  corresponding  conflict  vector  [1,  -1.0]T  is  not  feasible. 
Conflict  vectors  for  II2  and  Il3  are  [{p  +  1),  -2.  {ft  -  1)]T 
and  [2,  ~(p  +  1),  (1  -  p)]T,  respectively.  When  p  is  an  even 
number,  II2  and  II3  are  feasible.  However,  when  p  is  odd,  both 
n2  and  n3  are  not  feasible.  The  reason  why  some  vectors  in 
the  solution  space  are  not  feasible  is  as  follows.  In  (5.3)  the 
constraint  gcd(/i,  •••,/„)  =  1,  where  /*,  i  =  1,  •  •  • ,  n  are  as 
defined  in  (3.2),  is  not  included  because  this  constraint  makes 
the  problem  more  difficult  to  solve.  When  all  extreme  points 
with  the  shortest  execution  time  do  not  contribute  conflict- 
free  mappings  for  a  particular  p,  other  solutions  which  are  not 
extreme  points  and  have  the  shortest  execution  time  should 
be  considered.  However,  those  nonfeasible  extreme  points 
provide  a  lower  bound  on  the  execution  time  by  all  feasible 
solutions.  For  example,  when  p  is  an  odd  number,  all  the 
three  extreme  points  with  the  shortest  execution  time  n 3 ,  II2, 
and  II3  do  not  contribute  conflict-free  mappings.  Because  the 
sum  of  the  entries  of  each  of  IIi,  II2,  and  II3  are  p  +  2, 
other  solutions  with  the  same  execution  time  have  the  property 
xt  -i-  x2  +  x3  =  2  +  At.  Let  us  consider  Iln  =  [1, 2 ,p  -  1],  a 
solution  of  equations  =  1  and  xt  -f  x2  -4-  x3  =  p  4-  2.  The 
conflict  vector  of  Iln  is  7u  =  [p  +  1.  -p,  1)T-  Thus,  IIu  is 
feasible  and  optimal  because  the  conflict  vector  is  feasible  and 
it  has  the  same  execution  time  as  the  extreme  points  Ill,  II2, 
and  n3  which  have  the  shortest  execution  time.  Therefore, 
for  the  matrix  multiplication  algorithm,  when  p  is  odd.  one 
optimal  solution  is  IIU  and  when  p  is  even.  II2,  II3,  and  Fin 
could  be  chosen  as  the  optimal  solution. 

Let  us  design  the  linear  systolic  array  for  mapping  matrix 

T=[n"]  =  [l  2  ~2  W*iere  M  =  3.  If  P  =  [1. 1.  -1]  is 
chosen  as  the  matrix  of  interconnection  primitives  and  K  =  I 
(the  identity  matrix),  then  SD  =  PK,  ^7=1^1  =  1  = 

ndi  =  1.  kj2  =  1  <  nd2  =  2.  k]3  =  i  <  rw3  = 

2  and  therefore,  constraint  2  in  (5.3)  is  satisfied.  Because 
ni2  -  £^=1  kj2  +  nd3  -  k,3  =  2,  two  buffers  are 

needed,  one  for  the  link  of  data  A  induced  by  dependence  d2 
and  the  other  for  the  link  of  data  C  induced  by  dependence  d3. 
The  systolic  structure  and  the  execution  are  shown  in  Figs.  2 
and  3,  respectively. 

Notice  that  there  is  no  data  link  collision  because  in  every 
column  and  every  row  of  matrix  K  there  is  only  one  nonzero 
entry  kp  =  1,  i  —  1,  •  •  • ,  3.  This  means  that  when  data  pass 
from  the  source  to  the  destination,  they  use  the  data  link 
just  once  (one  hop  between  two  PE’s).  Data  link  collisions 
occur  only  if  data  use  links  more  than  once  when  passing 
from  the  source  to  the  destination.  For  example,  if  the  space 
mapping  S'  —  [1. 1.  m)  and  P*  —  [1, 1.  Ij,  then  to  satisfy  the 
condition  SD  =  PK,  one  possible  set  of  values  for  K  is 
fcn  =  k-n  =  1-  £33  —  it  and  k,,  =  0.  i  ft  j.  Thus,  the  distance 
between  the  source  and  destination  for  data  C  is  p  PE’s  and 
data  C  will  take  p  hops  over  the  third  link  in  P,  or  the  link  for 
C  to  reach  the  destination.  Suppose  PEi,  i  =  1.  •  •  • ,  p,  are 
sending  data  Xij  (corresponding  to  c,,  of  matrix  (7)  to  PE,+* 
at  time  tj,  j  =  1.  •  •  • .  p.  Then  at  time  ft,  rlt  is  on  the  link 
between  PE\  and  PEi-  At  time  t2>  two  pieces  of  data  in 
and  1 22  are  on  the  link  between  PEi  and  PE3  and  so  on. 
At  time  -  1,  p  -  1  pieces  of  data  in,  i22,  •  •  • .  i„-i.M-i 


362 


IEEE  TRANSACTIONS  ON  PARALLEL  AND  DISTRIBUTED  SYSTEMS.  VOL.  3.  NO.  3.  MAY  1992 


are  on  the  link  between  P£„-i  and  PE^.  So,  link  collisions 
exist  after  time  fx.  This  is  caused  by  £33  =  /x. 

Discussion  of  Example  5.2:  Similarly,  the  integer  program¬ 
ming  problem  in  (5.4)  can  be  converted  to  the  following  two 
integer  linear  programming  problems. 


subject  to  < 


subject  to 


(/)  min  /  =  4-  t2  4-  x3) 

^  1)  Xi  >  1,  j  =  2.3.  Jl-!  -  7T2  —  T3  >  1, 

Ti  -  T2  >  1.  Tt  -  T3  >  1 

2) T2>M+"l 

3)  Tx  <  p  +  1 

U)  ne  2lx3 


(a.2) 


(II)  min  /  =  p(tt\  -4-  rr2  h-  t3) 

1)  >  l.i  =s  2.3,  Ti  -  T2  -  t3  >  1. 

—  7T2  >  1,  TTi  —  rr3  >  1 

2)  Tl  >  (i  •+*  1 

3)  rr  e  zLxy. 


Again,  because  alt  extreme  points  of  the  onvex  solution 
subsets  are  integral,  techniques  for  linear  programming  can 
be  applied,  which  check  all  extreme  points  of  the  solution 
subsets.  One  of  the  extreme  points  with  the  minimum  value 
of  the  objective  function  /  is  the  optimal  one.  Again,  each 
extreme  point  is  a  solution  of  the  following  seven  equations. 
X2  a  1,  T3  s  1,  Ti  —  X2  —  *3  =  1,  XX  -  T2  =  1, 

tx  -  x3  =  1.  ti  =iirl,  and  xt  =  p  -I-  1.  Four  such 
extreme  points  are  nt  =  [/x  4- 1. 1, 1],  n2  =  [/x  4- 1. 1.  p  -  1], 
n3  =  [p  +  1.1. /xj,  and  IU  =  [p  +  l.(t  -  1,1].  When 
M  >  2,  all  the  above  linear  schedule  vectors  satisfy  constraint 
UD  >  0.  The  corresponding  conflict  vectors,  according  to 
(3.2),  are  71  =  [1.  —  (/x  4-  1),0]T,  72  =  (1,  — (a*  +  1),0]T, 
5a  =  (1.  -(jx4-l),0]r,  and  74  =  [m-I*  -(m+1),0)t.  Vectors 
7.,  »  =  1,  •  •  ■ .  3,  are  feasible  and  74  is  feasible  if  p  is  an  even 
number.  The  extreme  point  with  the  minimum  value  of  the 
objective  function  /  is  11°  =  Ill  =  [/x  4- 1, 1, 1]  and  the  total 
execution  time  by  II  x  is  p(p  4-  3)  4- 1  according  to  (2.7). 

If  the  matrix  of  interconnectioir  primitives  P  =  SD  — 
[1.0, -1,0, -1]  is  chosen  and  K  —  /,  the  identity  matrix, 
then  constraint  2  in  (5.4)  is  satisfied.  Similar  to  the  design 
for  the  matrix  multiplication  algorithm,  there  is  no  data  link 
collision  because  in  every  column  and  every  row  of  matrix  K 
there  is  only  one  nonzero  entry  ka  =  1,  1  =  1.  •  ■  • .  3. 


acknowledgment 

We  would  like  to  thank  one  referee  for  the  ten-page  con¬ 
structive  and  very  helpful  comments  and  Y.  X.  Wang  for 
drawing  the  figures  in  this  paper. 

References 

[1|  R.  M.  Karp.  R.  E.  Miller,  and  S.  Winograd.  'The  organization  of 
computation,  for  uniform  recurrence  equations.  '  J.  ACM.  voi.  14.  no. 
3.  pp.  563-590.  July  1967. 

(2)  D.  I.  Moldovan  and  J.  A.  B.  Fortes.  "Partitioning  and  mapping-  algo¬ 
rithms  into  fixed  size  systolic  arrays. ’’  IEEE  Trans.  Compu t.  voi.  C-35, 
pp.  1-12,  Jan.  1986. 

[3|  P.  R.  Cappeilo  and  K.  Steigiitz.  "Unifying  VLSI  array  designs  with 
geometric  transformations."  in  Proc.  1983  lot.  Conf,  Parallel  Processing, 
pp.  448-457. 


[4]  P.  Quinton.  "Automatic  synthesis  of  systolic  arrays  from  uniform 
recurrent  equations."  in  Proc.  1 1th  Anna.  Symp.  Comput  Architecture. 

1984.  pp.  208-214. 

[5]  S.  K.  Rao,  "Regular  iterative  algorithms  and  their  implementations  on 
processor  arrays.”  Ph.D.  dissertation.  Stanford  Univ.,  Stanford,  CA.  Oct. 

1985. 

[6]  M.  Chen.  "A  design  methodology  for  synthesizing  parallel  algorithms 
and  architectures."  J.  Parallel  Distributed  Comput,  Dec.  1986. 
pp.  461-491. 

[7]  J.-M.  Delosme  and  I.  C.  F.  Ipsen.  "An  illustration  of  a  methodology  for 
the  construction  of  efficient  systolic  architectures  in  VLSI,"  in  Proc. 
Second  Int  Symp.  VLSI  TechnoL,  Syst  AppL.  1985,  pp.  268-  273. 

[8]  S.  Y.  Kung,  VLSI  Array  Processors.  Englewood  Cliffs,  NJ:  Prentice- 
Hall.  1987. 

[9]  C  Guerra  and  R.  Melhem,  "Synthesizing  non-uniform  systolic  designs." 
in  Proc.  1986  Int  Conf.  Parallel  Processing,  pp.  765-771. 

[ 10]  G.-J.  Li  and  B.  W.  Wah,  “The  design  of  optimal  systolic  arrays,"  IEEE 
Trans.  Comput,  voi.  C-34,  pp.  66-77,  Jan.  1985. 

[11]  M.  T.  O’  Keefe  and  J.  A.  B.  Fortes.  "A  comparative  study  of  two  system¬ 
atic  design  methodologies  for  systolic  arrays,"  in  Proc.  1986  Int  Conf. 
Parallel  Processing,  pp.  672  -675. 

[12]  J.A.B.  Fortes  and  F.  Parisi-Presicce  "Optimal  linear  schedule  for  the 
parallel  execution  of  algorithms,"  in  Proc.  1984  Int  Conf.  Parallel 
Processing,  pp.  322-328. 

[13]  L.  Lamport  "The  parallel  execution  of  DO  loops,"  Common.  ACM.  voi. 
17,  no,  2,  pp.  83-  93.  Feb.  1974. 

[14]  J.-K.  Peir  and  R.  Cytron.  “Minimum  distance:  A  method  for  partitioning 
recurrences  for  multiprocessors,"  in  Proc.  1987  Int  Conf.  Parallel 
Proc  ess m  g,  pp.  217-  225. 

[15]  V.  P.  Roychowdhury  and  T.  Kailath.  “Subspace  scheduling  and  parallel 
implementation  of  non-systolic  regular  iterative  algorithms."  J.  VLSI 
Signal  Processing,  voi.  1.  1989. 

[16]  W.  Shang  and  J.A.B.  Fortes.  "Time  optimal  linear  schedules  for 
algorithms  with  uniform  dependencies."  IEEE  Trans.  Comput.  voi.  40. 
pp.  723  -  742.  June  1991. 

[17]  S.  Y.  Kung,  S.C.  Lo.  and  P.  S.  Lewis.  "Optimal  systolic  design  for  the 
transitive  closure  and  the  shortest  path  problems."  IEEE  Trans.  Comput. 
voi.  C-36.  pp.  603  -  614,  May  1987. 

[18J  G.  Strang,  Linear  Algebra  and  its  Applications,  second  ed.  New  York: 
Academic.  1980. 

[19]  R.  Cytron.  "Doacross:  Beyond  vectorization  for  multiprocessors  (ex¬ 
tended  abstract),"  in  Proc.  1986  Int  Conf.  Parallel  Processing,  pp. 
836-  844. 

[20]  R.  Kannan  and  A.  Bachem.  "Polynomial  algorifJuns  for  computing 
the  Smith  and  Hermite  normal  forms  of  an  integer  matrix."  SIAM  J. 
Comput.  voi.  8.  no.  4.  pp.  499-  507.  Nov.  1979. 

[21]  H.T.  Kung  and  C.  E.  Leiserson.  "Algorithms  for  VLSI  array  proces¬ 
sors."  in  Introduction  to  VLSI  Systems.  C.  Mead  and  L  Conway,  Eds. 
Reading,  MA:  Addison- Wesley.  1980.  sea.  8.3. 

[22]  P.  Lee  and  Z.  M.  Kedem.  "Mapping  nested  loop  algorithms  into  mul¬ 
tidimensional  systolic  arrays."  IEEE  Trans.  Parallel  Distributed  Syst. 
voi.  I.  pp.  64-76.  Jan.  1990. 

[23]  "Synthesizing  linear  array  algorithms  from  nested  for  loop 
algorithms."  IEEE  Trans.  Comput.  voL  37.  pp.  1578- 1598.  Dec  1988. 

[24]  D.  A.  Padua.  "Multiproccssora:  Discussion  of  theoretical  and  praaica! 
problems.”  Ph.D  dissertation.  Univ.  of  Illinois  at  Urbana- Campaign. 
Rep.  UIUCDCS-R-79-990.  Nov.  1979. 

[25]  Y.  Wong  and  J.-M.  Delosme.  "Optimal  systolic  implementation  of 
N-dimensional  recurrences,"  in  IEEE  Proc.  ICCD,  1985.  pp.  618-621. 

[26]  V.  E.  Taylor  and  J.  A.  B.  Fortes.  "Using  RAB  to  map  algorithms  nto 
bit-level  systolic  arrays."  in  Proc.  Int  Conf.  Supercomput.  May  1987. 

[27]  L.J.  Mordell,  Diophannne  Equations.  New  York:  Academic  1969. 
p.  30. 

[28]  M.T.  O'Keefe  and  J.A.8.  Fortes.  “Bit  level  processor  array:  Current 
architectures  and  a  design  and  a  programming  tool."  in  Pro <c.  1988  Int 
Symp.  Circuit  Svst.  Helsinki.  Finland.  June  1988.  pp.  2751-2755. 

[29]  A.  Schnjver,  Theory  of  Linear  and  Integer  Programming.  New  York: 
Wiley.  1986. 

[30|  W.  Shang,  "Scheduling,  partitioning  and  mapping  of  uniform  depen¬ 
dence  algorithms  on  processor  arrays."  Ph.D.  dissertation.  Purdue  Univ.. 
W.  Lafayette.  IN  47907.  May  1990. 

[3t]  T.  Blank  "The  MisPar  MP-l  Architecture."  in  Proc.  35th  IEEE  Comput 
Soc.  Int  Conf.  — Spring  Compcon  90.  San  Francisco,  CA.  Feb.  1990. 

[32]  D.  1.  Moldovan.  "On  the  design  of  algorithms  for  VLSI  systolic  arravs." 
Proc.  IEEE.  voi.  71.  pp.  113-120.  Jan.  1983. 

[33]  R.  Davis  and  D.  Thomas.  "Systolic  array  chip  matches  the  pece  of 
high-speed  processing,"  Electron.  Design,  Oct.  31.  1984. 

[34]  R.W.  Hockney  and  C.  R.  .'ess nope.  Parallel  Computers:  Architec¬ 
ture.  Programming  and  Algorithms.  Bristol,  Adam  Hilger  1981. 


SHANG  AND  FORTIES:  TIME  MAPPING  OF  UNIFORM  DEPENDENCE  ALGORITHMS  INTO  LOWER  DIMENSIONAL  PROCESSOR  ARRAYS 


363 


pp.  178-192. 

(35)  K.  E.  Batcher.  “Bit-serial  parallel  processing  systems.”  IEEE  Trans. 
CompuL,  vol.  C-31.  no.  5.  pp.  377  -  384. 

(36)  J.A.B.  Fortes  and  B.  W.  Wah.  "Systolic  array  —  From  concept  to 
implementation,"  IEEE  Comput.  Mag.,  pp.  12-17,  July  1987. 

.37]  Z.  Yang,  W.  Shang,  and  J.A.  3.  Fortes,  “One-to-one  :tme  mappings  of 
nested  algorithms  into  lower  dimensional  processor  arrays,”  in  Proc. 
Sixth  IEEE  Int  Parallel  Processing  Symp.,  Mar.  1992,  Beverly  Hills, 
CA.  pp.  156-  164. 


Weiji*  Shang  (S’88-M’90)  received  the  8.S.  de¬ 
gree  in  electrical  and  computer  engineering  from 
Changsha  Institute  of  Technology,  China,  in  1982 
and  the  M.S.  and  Ph.D.  degrees  in  electrical  engi¬ 
neering  from  Purdue  University,  West  Lafayette;  IN. 
in  1984  and  1990,  respectively. 

She  is  currently  an  Assistant  Professor  in  the  Cen¬ 
ter  for  Advanced  Computer  Studies,  the  University 
of  Southwestern  Louisiana.  Lafayette.  Her  research 
interests  include  parallel  processing,  computer  ar¬ 
chitecture,  algorithm  transformation,  processor  array 
programming,  special  purpose  VLSI  bit-level  processor  array  design,  and 
optimizing  compiler  technique. 

Or.  Shang  is  a  member  of  the  Association  for  Computing  Machinery. 


Jose  A-  B.  Fortes  (S'80-M’83)  received  the  Li- 
cenciatura  em  Engenharia  Electrotecnica  from  the 
Universidade  de  Angola  in  1978.  the  M.S.  degree 
in  electrical  engineering  from  the  Colorado  State 
University,  Fort  Collins,  in  1981,  and  the  Ph.D. 
degree  in  electrical  engineering  from  the  University 
of  Southern  California,  Los  Angeles,  in  1984. 

In  1984,  he  joined  'he  faculty  of  the  School 
of  Electrical  Engineering,  Purdue  University,  West 
Lafayette,  IN,  where  he  currently  is  an  Associate 
Professor.  From  July  1989  to  July  1990  he  served 
at  the  National  Science  Foundation  as  program  director  for  microelectronics 
systems  architecture.  His  research  interests  are  in  the  areas  of  parallel 
processing,  fault-tolerant  computing,  and  VLSI  computer  architecture  on 
which  he  coauthored  over  50  technical  papers.  His  research  has  been  funded 
by  the  Office  of  Naval  Research.  AT&T  Foundation,  General  Electric,  and 
(he  National  Science  Foundation. 

Dr.  Fortes  is  a  member  of  the  IEEE  Professional  Society.  He  is  on  the 
Editorial  Boards  of  the  Journal  of  Parallel  and  Distributed  Computing  and 
the  Journal  of  VLSI  Signal  Processing. 


I 


REFERENCE  NO.  7 


Chean,  M.,  and  Fortes,  J.  A.  B.,  "The  Full-Use-of-Suitable-Spaces  (FUSS)  Approach 
to  Hardware  Reconfiguration  for  Fault-Tolerance  Processor  Arrays,"  IEEE  Transac¬ 
tions  on  Computers,  Vol.  39,  No.  4,  April  1990,  pp.  564-571. 

Note  -  This  paper  describes  a  new  technique  for  processor  array  reconfiguration  and 
discusses  its  performance.  It  allows  a  processor  array  to  reconfigure  in  the  presence  of 
up  to  as  many  faults  as  the  number  of  spare  processors  available  with  a  probability  of 
more  than  98%. 


564. 


IEEE  TRANSACTIONS  OM  COMPUTERS.  VOB  MfcNOKAMUL.  19* 


Fig.  6.  The  expected  diagnosis  tune  of  two  l:-out-of-16  structures,  (a)  A 
lr-out-of-16  system  (pt :  0  J  -0.7).  (b)  A  Jc-out-of-16  system  (pr.  0.71  -0.99). 


Fig.  7.  k-out-of 236  structures  with  different  Tf.  pr.(0J,  0.7).  (a)  T,  *  1, 

Tf  »0J.  ibir,  *  l.  Tf  =  0.25. 


viduai  units  and  the  expected  time  for  testing  each  unit  is  known. 
A  proof  of  the  optimality  of  a  diagnosis  algorithm  was  given  for 
k-ouxsf-n  structures  along  with  a  method  of  compactly  representing 
am  optimal!  solution-.  Simulation  results  illustrate  the  reduction  in  di¬ 
agnosis  time  achieved  by  exploiting  knowledge  of  the  repair  strategy 
as  well  aa  knowledge  of  the  expected  yield  and  test  time  of  each  unit 
in  the  structure-. 


coat  of  testing  coherent  systems."  Euro-.  J.  Optr.  Res.,  vol.  7.  pp. 
284-289;  1981. 

[10]  P.  JedrzejowKZ,  "Minimizing  the  average  coarof-testing  coherent  sys¬ 
tems:  Complexity  and  approximation  algorithms,’*  IEEE  Trans.  Reli¬ 
ability.  vol.  R-32,  pp.  66-70.  Apr.  1983. 

[LI]  R.  Btnterworth.  "Some  reliability  faul westing  models.”  Oper.  Res. 
vol.  20.  pp.  333-343.  Mar.-Apr.  1972. 

[12]  A.  Dicker,  Applied  Combinatorics.  New  York:  Wiley.  1984.  p 
447. 


RanaaNCBS 

[1]  M.-P.  Chang.  W.  K.  Riche,  and-i.  H.  Patel.  "  Diagnosis  and  repew  of 
memory  with  coupling  faults. ”  IEEE  Theme.  Compute  vol.  38,  pp* 
493-300.  Apr.  1989. 

[2T  R-  W)  Haddad  and  A.  T.  Debtors,  "Increased  throughput  for  the 
testing  and  repair  of  RAM*  with  redundancy,”  in  Proc.  IEEE  Int. 
Conf.  Compoe.-AUedr  Design,  1987.  pp;  230-233. 

[3>  W.  K.  Hoang  and P.  Lombardi-.  ‘Minimtzing  the  cost  of  repairing  WST 
memories."  as  Proc.  IEEE  Inti  Conf.  Wafer  Scaia  Integration-.  Jim 
1989*.  ppt  183-192:. 

(4)  A.  V.  Fsrria-Prabtat  L.  D.  Smsdx  H.  A.  Booges,  and  J.  K.  Paulson; 
"Radial  yield  verintione  in  semiconductor  wafers. "  IEEE  Cimua  De- 
vicer,  pp.  42-47.  Mar.  1989. 

[3]  A.  B.  Leslie  and  A.  P.  Mans.  “Wrfer-level  testing  with  a  membrane 
probe:”  TEES  Design  Test  Compos.,  vot.  6.  pp.  10-17,  Feb.  1989. 

(6f  I.  Haipenti  “Hath  testing  for  a  k-oat-of-n  system,"  Optr.  Res.,  vol. 
22-.  ppi.  1267-1271.  Nov.-Dee.  1974. 

[7]  S.  SaUount  and  M-  A.  Brener.  "An  optimum  testing  algorithm  for 
sa—  w  syitmns. "  J.  Math.  Asset.  Appt..  vol.  101. 

pp*  170-199  1984s. 

[8*  Y.  Ben-Dew.  "Optimal  testing  procedures  for  special  structures  of- 
cobarana  syseeaa."  Managemetu  Set.  voL  27,  pp.  1410-1420,  Dee. 
1981. 

{9}  — ,  "A  breach  and  bound  algorithm  for  minimiaiif  the  expected 


Thd  FuU-Use-of-SniteMn  Spawn  (FUSSh  Appaoachr  to*  Hardware 
Reconflgum tint  foe  Fnnl»-Tolemn>  Pro  reman  Army 

MENGLY  CHEAN  and  JOSB  M.  Be  FORTES 
posed  for  VLSLW» ItnMr  tolwnaf  pin  warn  amgn.  Thm  tmbnfoue. 

Manuscript  received  July  2,  1989:  revised  Novwnbm  30t  1989  This  work 
was  supported  in  pert  by  the  Inersemw  Science  sndTbchnnlogy  Office  of  the 
Strategic  Defense  Initiative  Organization  lad-wnaadnamnarsti  through  the 
Office  of  Nival  Research  under  Connects  OO0!4-89MSWandOOW4-88-k- 
0723. 

M.  rtiinn  is  rail  Brilrin  ITiutIi '“-’tin  Shri IB — — | — 
Houston.  TX  77023. 

J.  A.  B.  Forces  is  wide  dm  School  odElactrieallBaghwmingfcPUutae  Uni¬ 
versity,  West  Lafayette.  IN  47907. 

IEEE  Log. Number  8933877.  .ij; 


0018-9340/9004000564S01.00  ©  1990  IEEE 


IEEE  TRANSACTIONS  ON  COMPUTERS,  VOL.  39.  NO.  4.  APRIL  1990 


565 


1 


surplus  victor,  to  guide  the  repine—  of  fruity  processors  wtthio  aa 
a my.  Analytical  stedy  of  (bo  |wnl  FUSS  algorithm  shows  that  tharo  to 
a  Ueeerrelatloeshiy  between  the  array  sire  tntbthe  arte  of  interconnect 
rtqiimt  for  reconflfuntiew  t»  bar  100%  iwcoihl. 

Ia  aa  instance  of  FUSS,  called  simple-  FUSS,  reconfiguration  b  done 
bjr  simply  "shifting”  up  or  down-  faalty  processors  along  their  corre- 
tpooulag  coioouw  according  to  the  sorpias  sector’s  entries.  The  sorplas 
vector  b  progresbrrty  apdated  after  each  column  is  reconfigured.  The 
recoaflg aratioa  b  Mcceasfal  when  the  surplus  vector  becomes  null.  Sim¬ 
ple  FUSS  b  Mac— ed  in  detail  and  evaluated.  Simulations  show  that 
when  the  number  of  faalty  processors  b  equal  to  that  of  spare-  proces¬ 
sors,  simple  FUSS  can  achieve  a  probability  of  survival  as  high  as  99%; 
(his  b  far  better  than  the  probability  of  survival  of  existing  schemes  with 
similar  complexity  in  execniiou  time  and  interconnection  hardware. 


-  -±-  °n° 


(a)  (b) 

Fig.  1.  FUSS  FTPA  model,  (a)  FUSS  FTPA.  (b)  FUSS  lower  support- 
domain  Af . 


Index  Terms —  Ault- tolerant,  reconfiguration,  redundancy,  reliability, 
survivability. 

I.  Introduction 

A  difficult  problem  faced  in  the  design  of  fault-tolerant  processor 
arrays  is  how  to  reconfigure  hardware  in  order  to  dynamically  re¬ 
place  faulty  processors  with  functional  spares.  This  paper  presents  a 
general  and  ideal  approach,  as  well  as  a  practical  instance  of  it,  that 
guarantees  successful  reconfiguration  with  higher  probability  than 
presently  known  approaches  of  comparable  space-time  complexity. 
For  the  proposed  scheme,  called  FUSS  (full  use  of  suitable  spares), 
basic  reconfiguration  and  interconnection  requirements,  and  proba¬ 
bility  of  survival  are  presented  and  discussed. 

Many  reconfiguration  schemes  (see  [1]  for  a  recent  discussion  and 
classification  of  extant  approaches)  fad  to  use  advanced  knowledge 
of  the  distribution  of  faulty  processors  in  the  array.  As  the  algo¬ 
rithm  proceeds,  faulty  processors  are  replaced  locally.  Thus,  spare 
processors  that  have  not  been  used  may  not  be  available  for  replacing 
faulty  processors  that  the  algorithm  encounters  later.  This  results 
in  low  probability  of  survival  (the  conditional  probability  that  the 
reconfiguration  is  successful  given  that  the  array  has  a  number  of 
faulty  processors),  especially  when  the  number  of  faulty  processors 
approaches  the  number  of  spares.  FUSS  uses  an  indicator  vector, 
the  surplus  vector,  to  guide  the  replacement  of  faulty  processors 
within  an  array.  This  vector  has  as  many  entries  as  the  number  of 
rows  of  the  array.  The  initial  value  of  entry  /  is  the  sum  of  spares  in 
the  subarray  consisting  of  rows  l  through  i,  minus  the  sum  of  faulty 
processors  in  the  subarray.  Based  on  the  entries  of  the  surplus  vector, 
reconfiguration  is  accomplished  by  “shifting”  faulty  processors  from 
regions  having  the  most  faulty  processors  to  regions  having  fewer 
faulty  processors.  The  surplus  vector  is  progressively  updated  during 
shifting,  and  reconfiguration  is  successful  when  the  surplus  vector 
becomes  null. 

General  FUSS  has.  in  its  moat  general  and  ideal  case,  100%  prob¬ 
ability  of  survival.  The  general  FUSS  algorithm  is  studied  (Section 
II).  and  then  a  practical  instance  of  FUSS,  called  simple  FUSS,  is 
discussed  in  detail  and  evaluated  (Section  III).  Simulation*  show  that 
simple  FUSS  can  achieve  a  probability  of  survival  a*  high  as  99% 
at  the  maximum  number  of  fruits  allowable  in  aa  array  (i.e..  when 
there  are  as  many  fruity  processors  as  there  are  spares);  this  is  far 
better  than  the  probability  of  survival  of  existing  schemes  with  sim¬ 
ilar  complexity  in  execution  time  and  redundant  hardware  (see  [2], 
(5),  [6],  and  references  therein). 

H.  FUSS  AmtOAOt  to  Array  Rsconriouration 
A.  Definitions  and  Notations 

In  FUSS,  the  array  is  assumed  to  be  Af  x  (N  -t-  O.  where  M  is 
the  total  number  of  rows.  N  is  the  total  number  of  columns,  and  C 
is  the  total  number  of  spare  columns.  Fig.  1(a)  shows  such  an  array. 

A  fauit- tolerant  processor  array  (FTPA)  is  a  two-dimensional 
array  of  identical  and  regularly  interconnected  processing  elements 
in  which  redundant  circuitry  and  spares  are  incorporated.  A  phys¬ 
ical  FTPA  is  a  nonreconfigured  FTPA  which  may  contain  fruity 
elements.  A  logical  FTPA,  oo  the  other  hand,  ia  a  reconfigured 


FTPA  that  is  fully  operational;  faulty  physical  processors  have  been 
replaced  by  operational  spares. 

A  cell  is  a  processing  element  (PE)  of  an  FTPA.  The  notation 
(/',  j)  refers,  to  the  cell  with  physical  coordinates  i,  j.  The  cell  with 
logical  cordinates  i,  j  is  referred  to  as  [/,  J].  A  reconfiguration 
mapping  d  is  a  function  that  maps  a  logical  cell  to  a  physical  cell. 
e.g.,  «([/,  j])  =  ( x ,  y).  The  inverse  of  <f>  maps  a  functional  physi¬ 
cal  cell  to  a  logical  cell,  e.g.,  *"'((/,  J))  =  [m,  n).  Also.  *,(  •  ) 
and  dj(  ■ )  map  a  logical  cell  to  the  corresponding  physical  row 
and  the  corresponding  physical  column,  respectively.  For  example. 
<k([m,  /»])  =  x  if  logical  cell  [m,  /*]  is  on  physical  row  x.  The 
functions  <k~'  ( • )  and  <bj'  ( • )  are  defined  similarly. 

An  available  cell  is  a  good,  functional  cell,  whereas  an  unavail¬ 
able  cell  refers  to  either  a  faulty  cell  or  a  fault-free  cell  that  is  already 
used  to  replace  another  cell.  Shift  refers  to  logical  replacement  of 
an  unavailable  cell  by  an  available  one.  Cell  (/,  j)  is  shifted  to  ceil 
(m,  n)  if  (m,  n)  replaces  (/',  j). 

A  status  matrix  is  an  M  x  (N +C)  matrix  whose  entries  represent 
the  status  of  cells  of  an  array.  In  a  physical  FTPA.  the  status  of  a  cell 
is  either  0  (fruit-free)  or  1  (fruity).  In  a  logical  FTPA.  the  status  of 
a  cell  may  also  include  a  higher  number  (2,  3,  -  -  -)  to  indicate  that  it 
replaces  another  cell. 

A  support-domain  A  of  a  cell  is  a  set  of  cells  called  supporters, 
each  of  which  may  replace  the  cell  when  it  becomes  unavailable.  Two 
symmetrical  support-domains  are  defined  for  each  cell  in  the  physical 
array.  Fig.  1(b)  shows  the  lower  support-domain,  denoted  by  A,; 
the  upper  support-domain  (A.)  is  similar  except  that  it  is  the  mirror 
image  of  the  one  shown.  A  support-domain  has  a  triangular  shape 
(except  near  the  array  boundary)  such  that  the  Manhattan  distance1 
between  the  cell  and  any  of  its  farthest  supporters  is  the  same;  this 
distance,  denoted  by  5,  is  referred  to  as  support  radius. 

The  connection  window  W  of  a  cell  is  the  set  of  all  physical  cells 
that  can  be  logical  neighbors  to  the  cell.  Two  connection  windows  are 
associated  with  a  cell,  horizontal  window  WH  and  vertical  window 
WY.  The  size  of  a  window  is  the  number  of  cells  it  contains. 

The  probability  of  survival  or  survivability  a  is  the  conditional 
probability  of  success  in  reconfiguring  an  array  given  that  the  array 
has  a  number  of  fruity  cells.  The  demand-spare  p  is  the  ratio  of 
the  number  of  fruity  cells  to  the  total  number  of  spares  in  the  array 
(i.e.,  the  demand  for  spares  normalized  with  respect  to  the  number 
of  spares). 

Let  B  =  [btj]  be  the  status  matrix  of  a  physical  FTPA;  i.e., 
bij  *  1  if  (i,  j)  is  fruity,  and  b\%j  -  0  if  (t,  j)  is  fruit-free.  The 
following  is  a  list  of  additional  symbols  used  throughout  the  paper 
and  their  meaning. 

T  An  M  x  M  lower  triangular  matrix  such  that  T  = 
tl%j  =  0  if  i  <  j,  and  ttj  =  1  otherwise, 
it  Unit  vector «=  [1, 1,-  ■  -,l]r 
n-  Integer  vector  n  =  (1,  2.  •••  ,Mf. 

1  The  distance  batwaaa  ceils,  say  (/,  j )  and  (k,  /),  is  defined  as  the  Man¬ 
hattan,  distance  between  them,  i.e., 

dW.J),  (*./)> + 1/- 1|. 

The  unit  is  aaaamed  to  be  the  ceil  side,  the  cell  being  a  square. 


566 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  39.  NO.  4.  APRIL  1990 


/  fault  vector ;/  =  [/, ,  fXt- ■  ■  ,/wJr.  /<  =  £^1° 

F  Total  number  of  faulty  cells  in  the  array;  F  =  V*,/,. 

S  Total  number  of  spare  cells  in  the  array;  S  =  A 1C.  The  spare- 
demand  defined  previously  is  given  by  p  =  F/S. 

s  Surplus  vector;  s  =  [S|,s2,- •  •  ,sM]r,  where  s,  represents 
the  accumulated  surplus  in  the  subarray  consisting  of  rows  1 
through  i  (total  number  of  spares  in  the  subarray  minus  total 
number  of  faulty  cells  in  the  subarray);  see  (2.3)  below. 

All  cells  are  assumed  identical  and  self-testing,  faults  are  assumed 
independent  and  uniformly  distributed  in  the  physical  array. 

3.  Basic  Concepts 

The  fault  vector  and  the  surplus  vector  can  be  written,  respectively, 

f=Bu  (2.1) 

s=Cn-Tf.  (22) 

Each  entry  Si  in  s  indicates  the  accumulated  surplus  in  the  subarray 
consisting  of  rows  1  through  i: 

i  i 

s> = Ysc  - fi) = a  -  (2j) 

y-i  y-i 

a) Ifs,  >  0.  then  C  ■  i  >  S'.,(5Z*^C6y.*).  This  means  that  the 
sum  of  spares  in  rows  1  through  i  is  greater  than  the  number  of  faulty 
cells  in  rows  1  through  i;  thus,  row  i  has  extra  cells  available  for  use 
by  faulty  cells  in  rows  /  +  1,  /  -t-  2,  •  •  •  .M . 

b)  If  Si  <0.  then  C  i  <  bJ-^’  row  i  has  a  “cell- 

deficit'’  and  needs  to  use  available  cells  from  rows  i  + 1 ,  r  +2,  •  •  ,Af. 

c)  If  Sm  <  0.  then  the  total  number  of  spares  is  less  than  the  total 
number  of  faults.  In  this  case,  the  array  is  not  reconfigurable  and  the 
algorithm  must  exit  with  a  failure. 

The  basic  strategy  of  the  proposed  reconfiguration  scheme  is  to 
shift  cells  around  according  to  these  needs;  for  instance,  unavailable 
cells  are  shifted  to  row  i  if  st  >  0.  or  out  of  row  i  if  s,  <  0.  The 
generation  of  fault  vector  and  surplus  vector  is  shown  in  Fig.  2.  for 
a  5  x  (5  +  2)  physical  FTPA  (At  =  N  =  5,  C  =  2). 

C.  General  FUSS  Algorithm 

The  general  FUSS  algorithm  accepts  a  physical  FTPA  as  the  input, 
generates  a  logical  FTPA  as  its  output,  and  can  be  divided  into  four 
parts  which  execute  in  the  following  order  1)  array  preprocessing,  2) 
surplus  normalization.  3)  fault  shifting,  and  4)  cell  interconnection. 
This  subdivision  is  done  for  clarity  as  well  as  for  monitoring  the  algo¬ 
rithm  progress,  far  instance,  the  reconfiguration  may  be  successful 
when  “fault  shifting”  is  completed;  however,  it  may  fail  when  “cell 
interconnection”  is  applied  if  the  provided  interconnection  network: 
is  restricted,  far  this  reason,  the  study  of  the  effect  of  limited  in¬ 
terconnect  on  the  algorithm  can  be  easily  done.  A  flowchart  of  the 
general  FUSS  algorithm  is  provided  in  Fig.  3. 

Array  Preprocessing:  The  array  preprocessing  consists  of  gener¬ 
ating  fault  vector/  and  surplus  vectors,  and,  based  on  the  value 
of  sm,  deciding  to  abort  or  continue  the  reconfiguration.  It  can  be 
summarized  as  follows: 

1)  Generate/  and*  using  (2.1)  and  (2.2). 

2)  If  Sm  <  0.  exit  with  failure. 

Surplus  Normalization :  If  sM  >  0,  sm  extra  cells  in  rows  A/ 
are  available  for  use  by  “imaginary”  faulty  cells  in  nonexistent  rows 
A/+ 1,  M. +2, ■■■ .  Shifting  these  faulty  cells  to  available  cells  in  row 
At  would  be  required  by  the  rules  used  in  the  fault-shifting  phase  of 
FUSS.  A  better  solution  is  to  mahe  Sm  =  0,  which  is  one  of  the 
objectives  of  the  surplus  normalization.  A  second  reason  applies  to 
FTPA’s  with  a  number  of  faulty  ceils  much  smaller  than  the  number 
of  spares,  far  instance,  if  in  a  surplus  vector  there  exists  a  set  of  pos¬ 
itive  entries  with  increasing  values,  each  row  of  FTPA  corresponding 
to  each  entry  of  this  set  must  have  a  number  of  faulty  cells  less  than 
the  number  of  spare  cells.  This  means  that  shifting  faulty  cells  in 


0010011 

1 

3 

0000000 

1 

0 

r-B»- 

0000000 

1 

- 

0 

1001010 

1 

3 

0 100000 

1 

1 

i 

1  0  0  0  0 

3 

-1 

2 

11000 

0 

1 

3 

11100 

0 

- 

3 

4 

11110 

3 

2 

s 

11111 

1 

3 

Fig.  2.  Example  of  generation  of  fault  vector  /  and  surplus  vector  s  for  a 
physical  FTPA  with  status  matrix  B. 


Fig.  3.  General  FUSS  flowchart. 

these  rows  may  not  be  needed.  In  general,  unnecessary  shitting  may 
result  from  large  accumulated  surplus.  Thus,  surplus  normalization  is 
preferred  to  minimize  the  number  of  shiftings  required  by  the  actual 
faults. 

Surplus  normalization  is  outlined  below.  The  notation  S,/j  refers 
to  an  ordered  list  with  elements  (si,  st+ 1 ,  •  •  •  ,Sj),  i  <  j. 

1)  Givens  =  [Si,Sj,  -,SM]r.  s*  >0,  set  /  *  1. 

2)  Let  Sr/M  =*  (Sr ,  Sr+i ,  •  •  ,sm) 

3)  Partition  Sf/M  into  two  disjoint  lists: 

=  (St,  Sr+i,  •  •  •  ,Sr+*).  /  +Ar  <  At  and  Si  <0,  i  —  I , 
I  +  1  ,*•■,/  +  k\ 

Sr+k+UM  —  (Sr+*+l,Sf  +*+!,•  -,Sm).  •Sf+a+i  >0. 

4)  Partition  S/+*+i/m  into  two  disjoint  lists  (one  of  which  may  be 
empty): 

“(Jr+a+i.Sf+t+j.  '-.Jf-i,  s,).  l  <M, 

Si— i  <3i,  i  =  /  +k  +  2. 1  +■  k  +■  3,  — —  l,  and  Si— i  >  St 
(Sr+i+*//-i  has  elements  with  an  increasing  order  of  values); 
•Sf+i/sr  *  (4/+i  »  •J/+J,  ■  •  •  ,Sm). 

5)  If  Sz+t+i/f  3  (i.e.,  there  are  no  two  cooeecunve  ele- 

mentss,,s,+,  €  Sr+*+i//  such  that s,  >s»+i),  then  set s,  —  0, 
Vs,  €Sf+*+l/M;  goto  8). 

6)  Otherwise, 


[SEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  39.  NO.  i.  APRIL  1990 


567 


a)  offset*—  |  ®in(s,»'rw) 

\0  if  Si  <0 

b)  s,  <-s,  -  offset,  is,  €S, 

c)  Vs,  €  S/+*+1/,  if  s,  <0  then  reassign  s,  «—  0 

7)  If  s*  >  0,  /  * —  /  -t- 1  and  go  to  2). 

8)  Return  (»  <—  S,/M). 

To  illustrate  the  surplus  normalization,  consider  the  surplus  vector 
obtained  in  the  example  shown  in  Fig.  2.  The  first  iteration  through 
surplus-normalization  procedure  gives  S,/M  =  (-1,  1,  3.  2,  3); 
•S///+*  =  (  —  1);  S/+*+- i/m  =  (1.  3.  2,  3).  Step  4)  yields  = 

(1,  3,  2)  andS/+i/M  =  (3)  (note  that k  =  0 and  /  =4).  Instep  61,  a) 
offset  =  2;  b)  S/+*+i/m  =  (  —  1,  1,  0,  1);  c)  —  (0, 1, 0,  1) 

(“ -1,”  the  first  element,  is  reset  to  0  since  itisalsoinS/+*+|„).  The 
second  iteration  reassigns  S\t  (M  =  5)  to  zero  in  step  5).  The  normal¬ 
ized  surplus  is  then*  =  [-1, 0,  1. 0,  0]r.  As  will  be  shown  next  (in 
fault  shifting),  based  on  the  normalized  surplus  vector,  the  number 
of  shifts  will  be  |— 1|  +  1  =2  as  opposed  to  |  - 1|4- 1 4-3  4-2  4-3  =  10 
if  the  nonnormalized  surplus  vector  were  used. 

Fault  Shifting:  Each  unavailable  cell  in  row  /  of  a  physical  FTPA 
can  be  1)  shifted  down  to  one  of  its  supporters  if  s,  <  0,  or  2)  shifted 
up  to  one  of  its  supporters  in  row  j  ( j  <  i)  if  Sj  >0.  After  each  shift, 
die  corresponding  entry  in  the  surplus  vector  is  readjusted  toward 
zero;  i.e..  one  is  added  to  s,  in  case  l),  and  one  is  subtracted  from 
Sj  in  case  2).  The  shifting  proceeds  until  the  surplus  vector  becomes 
null;  its  effect  can  be  described  as  cell  migration  from  regions  having 
most  foully  cells  to  regions  having  less  faulty  cells.  Fault  shifting  can 
be  summarized  as  follows. 

Whiles  #  0  then  shift  unavailable  cells  as  follows: 

1)  For  /  =  1,2, if  s,  <  0  and  (/,  j)  is  unavailable. 

1  <  j  <  M  +  C,  then  shift  (/,  j)  to  one  of  its  supporters  (if  pos¬ 
sible)  in  A,;  then  increment  s,.  (Shifting  rules  must  be  defined  here. 
For  instance,  the  selection  of  a  good  supporter  could  consider  the 
closest  supporter  first,  etc.) 

2)  For  /  =  M-\,  M- 2,-  •  •  ,1,  if  Si  >  0  and  (k,  j)  is  unavailable. 

/  <  k  <  M.  1  <j<N  4 -C,  shift  (At,  j)  to  one  of  its  supporters  (if 
possible)  in  A«  and  in  row  /;  then  decrement  s,.  (Again  priority 
rules  must  be  observed  here.) 

Fault  shifting  generates  a  status  matrix  for  the  logical  FTPA.  Each 
entry  in  this  matrix  is  a  status  value  which  identifies  whether  the 
corresponding  cell  is  fault- free,  faulty,  or  “shifted.."  This  value  can 
also  reveal  the  replacing  cell  location.  A  concrete  example  will  be 
discussed  in  Section  III. 

Cell  Interconnection:  The  cell-interconnection  constructs  the 
logical  FTPA  represented  by  the  status  matrix  formed  by  “fault  shift¬ 
ing."  Ault-free  cells  are  systematically  interconnected.  This  imple¬ 
mentation  could  be  incorporated  into  the  "fault  shifting"  part  of  the 
algorithm  as  has  been  done  in  chain-replacement  schemes  [4],  [5]. 
Instead,  in  FUSS,  fault  shifting  and  cell  interconnection  are  done  in 
separate  steps  to  facilitate  the  analysis  of  the  effect  of  a  restricted 
interconnection  network  on  the  algorithm.  Ceil  interconnection  will 
be  illustrated  in  Section  III. 

D.  Interconnection  Requirements 

This  subsection  shows  the  minimum  size  of  a  support-domain  A 
required  by  the  algorithm  and  the  resulting  maximum  physical  dis¬ 
tance  (thus,  the  maximum  signal  delay)  between  two  adjacent  logical 
cells.  These  requirements  are  for  the  ideal  case  where  100%  prob¬ 
ability  of  survival  is  guaranteed.  As  illustrated  by  simple  FUSS  (in 
Section  III),  survivability  is  not  noticeably  worse  if  interconnect  is 
more  limited  and  interprocessor  delays  are  smaller. 

Theorem  I:  For  an  M  x  (N  +C)  physical  FTPA  with  MC  faulty 
ceils  (p  =  1).  the  minimum  support  radius  5  of  each  support-domain 
required  by  FUSS  in  order  to  achieve  a  100%  survivability  rate  (<r  = 
1)  is  given  by 


5 


y/(2 C  -t- 1)2  4-  9(M  -  1)C  -  (2 C  +  l) 
2 


(2.4) 


Proof:  The  wont  fault  distribution  that  causes  FUSS  to  require 


c+t 


r  ■  t 


u 

lv  Z 

rJ 

r' 

Y 

X 

Fig.  4.  Worst  case  block  of  faults. 


5 

A  c+t  \ 

(a) 

6 

y4  c+i  \ 

t 

y 

\’/ 

(b) 

Fig.  3.  Distance  between  two  adjacent  logical  cells,  (a)  Horizontal  logical 
neighbors,  (b)  Vertical  logical  neighbors. 


the  largest  support-domain  is  a  cluster  (of  faulty  cells)  that  spans  one 
corner  of  the  physical  array  as  shown  in  Fig.  4.  In  this  figure,  there 
are  C  4-  l  faulty  cells  in  the  first  row.  This  means  that  one  cell 
needs  a  replacement  from  its  lower  support-domain  A/ .  Let  a  block 
( UVZXY)  underneath  theC  +  1  faulty  cells  contain  the  total  possible 
number  of  additional  faulty  cells  but  such  that  the  reconfiguration  can 
still  succeed.  The  maximum  tolerable  number  of  faulty  cells  in  the 
FTPA  is  F  =S  —  MC.  Since  there  are C  4- 1  faulty  cells  on  the  top 
row.  the  number  of  faulty  cells  in  (UVZXY)  is  MC  —  (C  -(-  1). 
Furthermore,  the  ( UVZXY)  block  size  (  =  total  number  of  cells  in 
the  block)  must  be  at  least  one  cell  larger  than  MC  -(Crl)so 
that  there  is  at  least  one  fault-free  cell  inside  the  block  available  for 
replacing  a  faulty  cell  in  the  top  row.  This  block  dictates  the  size  of 
A,  required  to  support  the  algorithm.  Its  area  is 

(UZXY)  =  MC  -  (C  4- 1)  4- 1  =  (M  -  1)C.  (2-5) 

But 


( UZXY)  =  ( UVXY)  4-  ( VXZ) 


=  (C  4- 1)5  4- (5 -1)4- (5 -2)  4-  -4-1 


=  (C  4- 1)5  4- 


5(5  -  D 
2 


(2.6) 


Setting  (2.5)  equal  to  (2.6),  solving  for  6,  and  taking  the  greatest 
integer  value,  (2.4)  results.  ~ 

Theorem  2:  If  the  support  radius  of  the  support-domain  used  in 
FUSS  is  6.  then  the  maximum  possible  distance  between  two  adjacent 
logical  cells  is 


d( [/.  j],  [i,  j  4- 1])  =  d([i,  j ].  [/  +  l,  y])  =  25  +  C  4- 1.  (2.7) 

Proof:  Fig.  5(a)  and  Fig.  5(b)  show  the  worst  possible  cases 
that  result  in  largest  distances  between  two  adjacent  logical  cells.  In 
these  figures,  the  blocks  shown  inside  each  array  (rectangle)  contain 
cells  that  are  faulty  with  the  exception  of  cells  marked  with  x‘  and 
/  which  are  used  to  replace  cells  x  and  y,  respectively.  In  the  worst 


568 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  39.  NO.  4.  APRIL  1990 


case,  cell  x'  is  at  the  outer  left  edge  of  the  upper  block  shown,  and 
cell  y'  is  at  the  outer  right  edge  of  the  lower  Mock.  In  both  figures, 
x'  represents  [/,  y];  in  Fig.  5(a),  y'  =  p,  j  +  1],  and  in  Fig.  5(b), 
y'  =  p  +  l,yj.  As  shown  in  Fig.  5(a),  d(x',y')  is  the  sum  of 
between  x'  and  x  (=6),  x  and  y  (=C  +  1),  and  y  and  y' 
(=$).  Therefore,  cf(p,  j],  p.  j  +  1])  =  25  +C  +  1.  The  expression 
for  d([i,  J],  p  +  1,  j])  is  obtained  in  a  similar  way.  O 

m.  Simple  FUSS 

Simple  FUSS  is  an  instance  of  FUSS  in  which  each  support-domain 
is  restricted  to  one  cell  (5  =  1)  to  minimize  the  maximum  distance 
between  two  adjacent  logical  cells.  In  this  case,  shifting  is  either 
straight  up  or  straight  down  along  a  column.  From  here  on,  the  terms 
“simple  FUSS”  and  “FUSS”  are  used  interchangeably;  simple  FUSS 
with  C  =  1  is  referred  to  as  “FUSS-1 ,”  that  with  C  =  2  “FUSS-2,” 
and  so  on.  The  general  FUSS  scheme  discussed  in  the  last  section  is 
referred  to  as  “general  FUSS.” 


1 

2 

3  4 

S  8 

■ 

1 

2  3 

4 

5 

6 

■ 

1 

0 

0 

0  0 

0  0 

1 

0 

2  0 

0 

0 

0 

0 

2 

0 

0 

0  0 

0  0 

2 

0 

2  2 

0 

0 

0 

0 

3 

0 

0 

1  1 

0  0 

1 

0 

2  1 

1 

0 

0 

0 

4 

0 

1 

1  1 

0  0 

-1 

0 

1  1 

1 

0 

0 

0 

5 

0 

1 

1  0 

0  0 

-2 

0 

1  1 

3 

0 

0 

0 

S 

0 

0 

0  0 

0  0 

-1 

0 

3  3 

0 

0 

0 

0 

7 

0 

0 

0  0 

0  0 

0 

0 

3  0 

0 

0 

0 

0 

(•) 

(b) 

1,1 

2,2 

1.2 

1,3 

1,4 

1,5 

2,1 

3,2 

3.3 

2.3 

2,4 

2,5 

3,1 

4.2 

0,0 

0,0 

3,4 

3,5 

4.1 

0,0 

0.0 

0,0 

4,4 

4,5 

S,1 

0,0 

0.0 

4,3 

5,4 

5,5 

8.1 

5,2 

5,3 

6,3 

6,4 

6,5 

7,1 

6.2 

7,2 

7,3 

7,4 

7,5 

(c) 

Fig.  6.  An  example  of  FUSS-1,  (a)  Initial  status  matrix  and  surplus  vec¬ 
tor  Sir.  (b)  Final  status  matrix  B  (s  =  0).  (c)  Final  reconfigured  FTPA 
showing  logical  coordinates. 


A.  Simple  FUSS  Algorithm 


The  algorithm  consists  of  four  basic  procedures,  each  of  which  de¬ 
scribes  a  corresponding  part  in  the  general  FUSS  algorithm  described 
in  Section  II.  Procedures  corresponding  to  army  preprocessing  and 
surplus  normalization  are  as  described  in  Section  II.  In  fault  shifting, 


cell  (i,  J)  can  only  replace  either  cell  (/  -  1,  j)  or  cell  (/  +  1,  j). 
Fault  shifting  constructs  status  matrix  B  as  follows: 

_  /  0  if  (/,  y)  is  fault-free 

|  1  if  (i,  J)  is  faulty 


1)  Initialize!!:  b,. 


2)  Assign  status  to  “shifted-cells”:  bjj  —  2  if  cell  (/  +  1,  y)  is 
shifted  to  cell  (i,  j),  and  bitj  —  3  if  cell  (/  -  1,  j)  is  shifted  to 
cell  (i,  j). 

Matrix  S  is  formed  one  column  at  a  time.  The  order  in  which  the 
columns  are  considered  may  affect  the  shift  capability.  It  is  there¬ 
fore  conceivable  that  these  columns  can  be  scanned  in  an  order  that 
maximizes  FUSS  probability  of  survival  but  this  remains  an  open 
question  at  this  time.  For  simple  FUSS,  the  order  is  from  columns 
1  toN+C. 

Cell  interconnection  is  applied  to  matrix  B .  Since  the  status  of 
each  cell  is  known,  it  is  easy  to  automatically  derive  how  cells  are 
interconnected.  FUSS-C  algorithm  is  shown  in  detail  in  [7].  Each 
of  the  four  procedures  in  the  general  FUSS  algorithm  takes  at  most 
0(M  x  (N  +  O)  operations.  The  time  complexity  of  FUSS  is  then 
OIMN).  i.e.,  proportional  to  the  number  of  processors  in  the  array. 

The  following  example  illustrates  how  HJSS-1  reconfigures  a 
7  x  (5  + 1)  array.  Fig.  6(a)  represents  the  initial  status  matrix  of  the 
FTPA  augmented  by  the  normalized  surplus  vector,  row  and  column 
indexes  are  added  for  convenience.  Fault  shifting  scans  s  downward 
for  negative  surplus  values.  Since  sA  =  -1.  one  cell  is  shifted  down; 
the  only  choice  in  this  case  is  (4, 4).  As  a  result  h5,«  becomes  3,  to 
denote  that  (5, 4)  replaces  a  cell  from  above,  namely  (4, 4).  At  the 
next  row,  sj  =  -2,  which  means  that  two  unavailable  cells  must  be 
shifted  down;  ceils  (5,  2)  and  (5,  3)  are  chosen.  Similarly,  St  =  -1 
and  (7,  2)  replaces  (6,  2).  During  the  upward  scan,  the  first  positive 
surplus  encountered  is  Sj  =  1.  Thus,  one  cell  from  the  row  below 
must  be  shifted  upward  to  row  3.  Cell  (4,  2)  is  the  only  choice;  as 
a  result,  bj,2  -  2  to  denote  that  cell  (3,  2)  replaces  a  cell  from 
below.  Similar  actions  take  place  for  s2  and  s( .  Fig.  6(b)  shows 
the  final  status  matrix  after  a  successful  cell  shifting  Is  is  null).  The 
final  reconfigured  FTPA  is  shown  in  Fig.  6(c).  This  example  depicts 
the  worst  case  of  fault  distribution  in  a  physical  array  that  remits  in 
longest  interconnect  links  in  a  logical  array. 


B  Cell  Interconnect  Structure 

From  Theorem  2,  the  maximum  distance  between  two  adjacent 
logical  ceils  is  d([i,  j ],  [/,  y  +  1])  =  d(\i,  y],  [/  +  l.yD  »  C  + 
3.  fug.  6(c)  shows  that  the  disrenrr  between  {4,  2]  and  [4,  3]  is  4 
(*(4,  2J)  *  (3,  2)  and  *[4,  3J)  «  (5, 4)).  The  distance  between 
(3,  3]  and  [4,  3]  is  also  4.  In  general,  the  interconnect  requirements 
for  FUSS  can  be  found  by  studying  the  connection  windows  required 


□  □ 

□  □ 

0  □  □ 

□  □ 

□  □ 

(a) 

Fig.  7.  FUSS-1  physical  < 


□  0  □ 

□  □  □ 

□  □  □ 

□  □  □ 

0» 

windows,  (a)  WH ■  (b)  Wv. 


by  the  algorithm.  The  size  of  each  window  (number  of  cells  in  the 
window)  is  as  stated  in  the  following  theorem. 

Theorem  3:  In  FUSS,  the  size  of  horizontal  connection  window 
\WH\  and  the  size  of  vertical  connection  window  \WV\  of  a  cell  are 
given  by,  respectively. 


Wh  \  —  5(C  +  1)  (3.1) 

\Wy\  =4(2C  +  1)  —  1.  (32) 

Proof:  from  cell  shifting,  p/(p,  j])  «  k,  i  -  I  <  k  <  i  +  1 
(p,  JJ  is  on  physical  row  k),  and  Pi(p,y  +  ID  =l.i-\  <1  </  +  l 
This  means  that  the  physical  location  of  p,  j  +  1]  is  at  most  two 
physical  rows  above  or  two  rows  below  the  physical  location  of  [/,  j]. 
Therefore,  relative  to  dtft.  J]),  P,  j  + 1]  is  within  a  band  consisting 
of  five  physical  rows  as  shown  in  Fig.  7(a)  (for  C  —  1).  Since  there 
are  C  spare  columns,  the  link  from  [/,  y]  to  (/,  j  + 1]  can  skip  over 
at  most  a  number  of  C  cells.  Hence,  \WH  \  =  5(C  +  1). 

Without  loss  of  generality,  let  4(p,yD  =  (i,  y).  Consider  the 
physical  location  of  the  vertical  neighbor  [i  +  1,  y].  Since  physical 
ceils  may  be  shifted  up  from  the  immediate  row  below,  p  +  1,  y) 
can  be  on  physical  row  i;  i.e.,  d»(P.  yp  =  MP  +  1.  y’D  =  '•  Ak? 
since  cells  may  be  shifted  down,  the  physical  location  of  p  +  1,  j] 
may  be  three  physical  rows  below  that  of  p,  J]  (<to(P,  y‘D  =  <MP  - 
1,  y])  -  3).  Rirthermore,  [i  +  1,  j]  can  be  C  cells  to  the  left  (right) 
of  p,  j].  The  vertical  window  is  readily  seen  to  have  the  size  of 
\Wy'  =  4{2C  +  1)  —  1.  C 

For  FUSS-1,  \WH\ead  IH'kI  are  10  and  11,  respectively,  as  showr, 
in  Fig.  7.  This  means  that  interconnect  links  must  be  provided  for 
cell  so  that  it  can  communicate  with  any  one  of  10  cells  hori¬ 
zontally  and  any  one  of  11  cells  vertically.  These  are  not  necessarily 
*•*  links;  they  can  be  shared  buses,  virtual  channels,  etc.  In 
this  paper  a  switch-bus  similar  to  that  used  in  [5]-[7]  is  considered. 
In  general,  the  number  of  buses  and  switches  required  by  a  logical 
row  is  constant  and  that  required  by  a  logical  column  is  s  linear  func¬ 
tion  of  the  number  of  spere  columns.  FUSS-C  requires  a  (a  + 1)  x  3 


IEEE  TRANSACTIONS  ON  COMPUTERS;  VOL.  39.  NO.  4.  APRIL  1990 


569 


H,(J) 


HtO)  V,(j) 


4- 

(b) 

Fig.  8.  Switch- bus  structure,  (a)  FUSS-C.  (b)  Switch  settings. 


*"  Q_  Q  1^: : :  ip  Eil  Gal  -  *  -  [f]  [5]c5j~  *  *  i§j 


Fig.  9.  Source-sink  on  same  row;  source  at  left 


fillgj--  [5  ^  63  ^T:::  □  plpl— 


Fi«-  10.  Source-sink  on  same  row:  source  at  tight. 


switch  bus  as  shown  in  Fig.  8(a),  where  ot  *  f  1-501 .  (Thus,  FUSS- 
I  requires  a  3  x  3  switch  bus.)  The  following  corollaries  give  the 
size  of  the  interconnection. 

Corollary  l :  In  addition  to  cell-to-cell  links,  the  horizontal  path 
of  FUSS  consists  of  one  horizontal  bus  per  row,  two  vertical  buses 
per  column,  and  four  switches  per  cell  of  the  physical  FTPA. 

Corollary  2:  Let  a  =  f  1  -501  -  In  addition  to  cell-to-cell  links, 
the  vertical  path  of  FUSS  consists  of  one  vertical  bus  per  column, 
a  horizontal  buses  per  row,  and  2a  switches  per  cell  of  the  physical 
FTPA. 

Fig.  8  shows  a  switch-bus  structure  required  by  FUSS.  With  this 
structure,  there  are  many  possible  paths  between  any  pair  of  adja¬ 
cent  logical  cells.  Therefore,  a  shortest  path  must  be  devised  so  that 
the  interconnect  delay  is  minimized.  Due  to  space  limitations,  the 
following  bypass  interconnect  rules  are  summarized  next  (details  are 
given  in  [7]).  Bus  labels  Hi,  i  =0,  1,2,  and  V,,  i  =0,  l,--  -,a  are 
as  shown  in  Fig.  8.  Let  P([m,  /ij)  =  (i,  j),  n  +  1])  «  (k,  l), 
and*([m+l,  /tj)  =  ( p .  q);  i-2  <  k  </+2,y +  1  <1  <y+C  +  l, 
i  <  p  <  i  +  3,  and  j  -C  <q  <j  +C.  Cell  [m,  /t)  is  called  source 
(denoted  by  r)  and  cells  [Jt,  /]  and  {p,  q]  are  horizontal  sink  and 
vertical  sink,  respectively  (denoted  by  t). 

Connacting  (I,  j)  to  (k,  /): 

•  use  Ho(i)  to  go  rigbtward: 

•  use  H\(j  +  1)  or/and  //,(/)  to  go  downward: 

•  use  HtU)  or/and  H2(t  -  l)  to  go  upward. 

Connacting  (/,  j)  to  (p,  q): 

Let  S  =  fOJaj,  where  a  »  fl  JC|. 

1)  If  i  <  p  (Fig.  11),  then 

•  use  the  dm  available  K*(i).  x  »  2. 3,  •  ■  •  ,0  +  1  to  go  right 

or 

left  toward  {p,  q); 

•  use  V„(q)  to  go  downward; 


•  use  F,(p)  to  reach  (p,  q). 

2)  If  i  —  p,  then 

a)  If  j  <  q  (source  on  the  left:  Fig.  9): 

i)  use  the  first  available  bus  from  the  ordered  set 
{V\(i),  V*(i  - 1).  V„-di  - 1),  •  •  •,  Va+t(i  - 1)}  to  go 
right: 

ii)  if  all  buses  in  case  i)  are  exhausted,  use  the  first  available 
bus  from  the  ordered  set  { V2(i),  V}  (i),  ■  ■  - ,  Va^  (i) }  to 
go  right; 

iii)  use  V0(J)  for  case  i),  V0(q)  for  case  ii))  to  go  upward. 

b)  If  q  <  j  (source  to  the  right  of  sink;  Fig.  10): 

i)  use  the  first  available  bus  from  the  ordered  set 
{V2(i),  FjOV  4.i  (0}  to  go  left; 

ii)  if  all  buses  in  case  i)  are  exhausted,  use  the  first  available 
bus  from  the  ordered  set  {Kt(i),  Va(i  -  1),  Va~, (i  - 
1),  •  •  • ,  Vg+z (i  -  1)}  to  go  left; 

iii)  use  V0{q)  for  case  i),  F0(y )  for  case  ii))  to  go  upward. 
Illustrations  of  interconnection  between  (/,  j)  =  s  and  (p,  q)  — 

t  are  shown  in  Figs.  9-11  (common  subscript  is  used  for  a 
source-sink  pair).  Bus  V0  is  exclusively  assigned  to  its  correspond¬ 
ing  cell.  It  directs  the  signal  downward  for  case  1)  and  upward  for 
case  2).  This  exclusive  a—ignmMu  of  V0  may  result  in  extra  wire 
length  (two  units).  For  instance,  in  Fig.  9.  V0  on  the  left  of  tc  (and 
Vo  for  some  other  sinks)  may  be  used.  However,  this  can  make  the 
rules  more  complex. 

C.  FTPA  Structure  and  Arm  Overhead 

The  FTPA  cotxroi  mechanisms  required  to  support  a  reconfigura¬ 
tion  can  be  implemented  in  different  ways.  Since  a  scheme 

can  be  designed  to  be  either  “centralized”  (executed  by  a  central 
computer)  or  “distributed.”  the  complexity  of  the  control  hardware 
vanes  accordingly.  The  interconnect  redundancy  depends  on  whether 


570 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  39.  NO.  4.  APRIL  1990 


□  -  □ 


□ 


□ 


□  -  □  - 

Fig.  11.  Source-sink  on  dill  it  row. 


ezj  o 

Fig.  12.  FUSS-1  FTPA. 


multiplexers,  virtual  channels,  switched  buses  or  other  approaches 
are  used.  FUSS-C  uses  the  switch-bus  structure  shown  in  Fig.  8  and 
a  typical  FTPA  is  shown  in  Fig.  12,  for  FUSS-1.  In  Fig.  12,  rect¬ 
angles  represent  I/O  ports,  large  boxes  represent  cells,  small  boxes 
represent  switches,  dashed  lines  are  buses  for  horizontal  paths,  and 
solid  lines  buses  for  vertical  path.  Other  reconfiguration  circuitry 
(fault  and  surplus  counters)  is  not  shown  in  Fig.  14  but  can  be  found 
in  [7]  where  area  complexity  is  also  computed. 

D.  FUSS  Performance 

FUSS  has  been  evaluated  with  respect  to  its  probability  of  survival 
and  its  reliability.  The  former  was  achieved  by  extensive  simulation 
and  the  latter  was  estimated  by  the  MGRE  (model  generator  and 
reliability  evaluator)  software  tool  [9]  (details  shown  in  [7]). 

Arrays  of  various  sizes  (5  x  5  to  40  x  40)  and  with  varying  num¬ 
ber  of  spare  columns  (C  =  1-C  =  5)  have  been  simulated .  Faults 
were  randomly  injected  into  the  array  cells.  In  each  case,  probability 
of  survivals  are  obtained  for  100  000  arrays  having  demand-spare 
0 3  <  p  <  1 .0.  When  p  <  5,  the  survivability  rate  is  100%;  it  is  more 
than  93%  when  p  —  1.  For  the  purpose  of  comparison,  probabili¬ 
ties  of  survival  of  a  number  of  schemes  are  shown,  along  with  the 
complexity  of  each  algorithm,  in  Table  I  for  M  -N  =  20. 

FUSS  survivability  rate  can  be  further  improved  by  slightly  mod¬ 
ifying  the  cell-shifting  procedure.  In  the  new  cell-shifting  approach, 
cel]  (/',  j)  is  shifted  to  cell  (i  ±  1,  j),  not  only  when  cell  (i  ±  1,  j) 
is  available,  but  also  (if  possible)  when  cell  (/  ±2,  j)  is  available. 


TABLE  I 

Probability  or  Survival  of  Selected  Reconfiguration  Schemes  (Partially 
from  [3],  [4]) 


Probability  of  Survival  <7  of  FUSS  with  Improved  Cell-Shifting 
_ _ Number  of  Faults  —  Number  of  Spares  (g— 1V__ 

c 

5x5 

For  Array  Site  ( M 
10x10  20  x  20 

xF) 

30x30 

40x40 

1 

99.2567 

98.7047 

98.7383 

98.9145 

99.0524 

2 

98.7698 

98.6485 

99.1754 

99.4904 

99.6391 

3 

98.4911 

98.6554 

99.4346 

99.7239 

99.8534 

4 

98.3180 

98.6508 

99.5387 

99.8301 

99.9214 

5 

98.1736 

98.8242 

99.8006 

99.8670 

99.9509 

TABLE  II 

Array  Survivability  at  Maximum  Number  of  Faults 


Reconfiguration 

Algorithm 

Probability  of  Survival 
/’  —  ■50 _ P  — .75 _ P  — .90 

Direct  Reconfiguration  jlOj 

0(  AO 

0.43 

0.35 

0.27 

Fixed-F  ault-Stealing 

O(F) 

0.47 

0.38 

0.30 

Complex-F  ault-Stealing 

O(F) 

0.90 

0.84 

0.77 

Modified  Index-Mapping 

0(Nl) 

0.67 

0.62 

0.54 

MORA 

O(FtogF) 

0.90 

0.84 

0.78 

Spanning  Tree  Based 

O(NtoglogF) 

0.93 

0.89 

0.78 

Orthogonal  Mapping 

0(Nl) 

0.93 

0.88 

0.82 

Simple  FUSS 

OiF *) 

1.00 

1.00 

0.99 

This  provides  a  one  cell  “look-ahead"  capability  in  anticipating  and 
facilitating  further  shifting.  Simulation  results  show  much  improved 
probability  of  survival  as  summarized  in  Table  II.  Here,  only  the 
worst  case  (p  =  1 ,  number  of  faults  equal  to  number  of  spares)  is 
shown.  Each  case  involves  1  000  000  simulated  arrays. 

IV.  Concluding  Remarks 

A  new  approach  to  hardware  reconfiguration  for  fault-tolerant  pro¬ 
cessor  arrays  has  been  described.  The  technique,  called  FUSS  (full 
use  of  suitable  spares),  is  based  on  global  information  about  faulty 
cells  in  the  array.  That  is,  the  number  of  faulty  cells  and  their  dis¬ 
tribution  in  the  array  are  compiled  to  form  a  surplus  vector  which 
is  used  to  guide  cell  replacement.  The  general  FUSS  can  achieve 
100%  probability  of  survival  when  unlimited  interconnect  links  are 
provided.  These  interconnection  requirements  have  been  derived  and 
shown  to  be  proportional  to  the  size  of  the  array. 

An  instance  of  FUSS,  called  simple  FUSS,  has  been  presented 
in  detail.  Simulations  show  that  simple  FUSS  achieves  up  to  99% 
probability  of  survival  when  the  number  of  faulty  cells  equals  the 
number  of  spare  cells  in  a  given  array.  Simple  FUSS  executes  in 
O(MN)  time,  for  an  M  x  (N  +  C)  array. 

Recently,  Youn  and  Singh  [11]  have  independently  proposed  an 
approach,  called  row  modular  scheme,  similar  to  simple  FUSS  (the 
FUSS  approach  was  first  reported  in  [12]).  The  approach  is  an  in¬ 
stance  of  general  FUSS  and  is  implemented  for  yield  enhancement 
in  WSI/VLSI  processor  arrays. 

The  FUSS  approach  can  be  readily  applied  to  other  mesh-like 
structures,  such  as  programmable  logic  arrays,  memory  arrays,  etc. 
While  FUSS  has  been  designed  for  two-dimensional  arrays,  it  may 
also  be  extended  to  other  architectures  such  as  modified  meshes 
(toroidal  meshes,  meshes  with  broadcast  buses,  etc.),  trees,  hyper- 
cubes.  multistage  networks,  and  others.  The  use  of  FUSS  in  these 
and  other  useful  topologies  needs  further  study  and  evaluation  [7]. 

References 

[1]  M.  Chess  and  J.  A.  B.  Fortes,  “A  taxonomy  for  reconfiguration  tech¬ 
niques  for  fault-tolerant  processor  arrays. "  IEEE  Computer  Mag., 
Jan.  1989. 

[2]  R.  Negrisi  and  R.  Stefanelli,  “Comparative  evaluation  of  space-  and 
tune-redundancy  approaches  for  WS1  processing  arrays."  in  Wafer 
Scale  Integration,  G.  Saucier  and  J.  Trilbe,  Eds..  New  York:  Else¬ 
vier  Science.  1986,  pp.  207-222. 

[3]  L.  Jervis.  F.  Lombardi,  and  O.  Sciuto.  “Orthogonal  mapping:  A  re¬ 
configuration  strategy  for  fault  tolerant  VLSI/WS1  2-D  arrays."  in 
Proc.  Int.  Workshop.  Defect  Tolerance  VLSI  Sysi.,  Oct.  1988. 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  39,  NO.  4,  APRIL  1990 


571 


[4]  F.  Lombardi,  D.  Sciuto,  and  R.  Stefanelli,  “A  technique  for  recon¬ 
figuring  two  dimensional  VLSI  arrays,”  in  Proc.  Real-Time  Syst. 
Symp.,  1987,  pp.  44-33. 

[51  F.  Lombardi,  R.  Negrini,  and  R.  Stefanelli,  "Reconfiguration  of  VLSI 
arrays:  A  covering  approach.”  in  Proc.  17th  Jnt.  Symp.  Fault- 
Tolerant  Comput.  Syst.,  1987.  pp.  25 1-256. 

[6]  M.  G.  Sami  and  R.  Stefanelli.  “Fault-tolerance  and  functional  recon¬ 
figuration  in  VLSI  arrays,"  in  Proc.  Int.  Symp.  Circuits  Syst.,  1986. 
pp.  643-648. 

[71  M.  Chean.  “Hardware  reconfiguration  for  fault-tolerant  processor  ar¬ 
rays,"  Ph.D.  dissertation.  School  of  Elec.  Eng.,  Purdue  Univ.,  1989. 

[8]  L.  Snyder.  “Introduction  to  the  configurable  highly  parallel  comput¬ 
ers."  IEEE  Comput.  Mag.,  vol.  15.  pp.  47-55,  fan  1982. 

[9]  N.  Lopez- Benitez  and  J.  A.  B.  Fortes,  “Detailed  modeling  of  fault- 
tolerant  processor  arrays."  in  Proc.  19th  Int.  Symp.  Fault-Tolerant 
Comput.  Syst.,  June  1989,  pp.  545-552. 

[10]  M.  G.  Sami  and  R.  Stefanelli.  “Reconfigurable  architectures  for  VLSI 
processing  arrays,"  Proc.  IEEE.  vol.  74,  pp.  712-722.  May  1986. 

[11]  H.  Y.  Youn  and  A.  D.  Singh,  "Efficient  reconfiguration  of  WSI  arrays 
with  bounded  channel  width."  in  Proc.  Int.  Workshop.  Hardware 
Fault  Tolerance  Multiprocessors,  June  1989,  pp.  24-26. 

[12]  M.  Chean  and  J.  A.  B.  Fortes.  “FUSS:  A  reconfiguration  scheme  for 
fault-tolerant  processor  arrays."  in  Proc.  Int.  Workshop.  Hardware 
Fault  Tolerance  Multiprocessors,  June  1989,  pp.  30-32. 


On  Reliability  Modeling  of  Closed  Fault-Tolerant  Computer 
Systems 

MEERA  BALAKRISHNAN  and  C.  S.  RAGHAVENDRA 

Abstract—  Markov  modeling  techniques  have  proved  useful  for  reli- 
iMllty  prediction  of  complex  fault-tolerant  computers.  In  [7|,  Ng  and 
Avizienis  proposed  a  general-  model  for  evaluating  fauit-ioleram  com¬ 
puter  systems— a  continuous-  time  Markov  modes  referred  to  in  the 
ARIES  model  and  suggested  the  Lagrange-Sylvester  interpolation  for¬ 
mula  as-  a  unified  solution-  technique  to  this  model.  This,  is  a  vadd 
solution  approach  if  the  eigenvalues  of  the  stale  trust tioa  rata  matrix 
associated  with  the  Markov  chain  am  distinct  or,  if  eigenvalues  repeat, 
the  rata  matrix  is  diagoualizafale.  In  [2|.  Geist  and  Trivedi  have  pre¬ 
sented  an  example  of  a  closed  (i.e.,  nonrepairs  Me)  fault-tolerant  system 
modeled  by  ARIES  which  has  repeated  eigenvalues;  ia  this  paper,  we 
observe  that,  ia  fact,  a  large  number  of  closed  faalt-loteraiu  systeau 
modeled  by  ARIES  have  repeated  eigenvalues  The  Lagrange-Sylvester 
formula  is  an  acceptable  solution  approach  only  if  it  can  be  guaranteed 
that  the  rate  matrix  is  dtagoaalixaMs. 

la  this  paper,  see  prove  that  the  rate  matrix  representing  the  system  is 
diagonalixabie  for  every  closed  fault- tolerant  system  modeled  by  ARIES. 
Consequently,  the  Lagrange- Sylvester  interpolation  formula  is  applicable 
to  all  dosed  fault- tolerant  systems  which  ARIES  models.  Also,  since  oar 
proof  guarantees  that  the  rate  matrix  is  diagoaalhabia.  general  methods 
for  solving  arbitrary  Markov  chains  can  be  tailored  to  solve  the  ARIES 
model  for  dosed  systems  efficiently.  For  example,  the  ACE  algorithm 
for  the  transient  sotntioe  of  acydie  Markov  chains  |4]  can  be  tpedaUxed 
to  solve  the  ARIES  model  for  doeed  systems  efficiently. 

Index  Terms—  Closed  fault- tolerant  system*.  Laplace  transform. 
Markov  modeling,  reliability  prediction. 

Manuscript  received  July  3,  1989:  revised  November  6.  1989.  M.  Bala- 
Imshnaa  was  supported  by  a  Zoom  International  Amerlia  Earhart  Fellowship 
Award.  C.  S.  Raghavendra  was  supported  by  the  NSF  Grata  MIP84520G3. 
a  gran  from  AT&T.  and  a  grata  from  TRW. 

The  authors  are  with  the  Department  of  Electrical  Engineering-Systems. 
University  of  Southern  California.  Los  Angeles,  CA  90089. 

IEEE  Log  Number  8933878. 


I.  Introduction 

A  major  application- of  closed  (nonrepairable)  fault-tolerant  com¬ 
puter  systems  is  in  the  aerospace  industry.  Indeed,  it  was  the  space 
program  in  the  early  years  chat  motivated  research  in  the  area  of 
fault-tolerant  computers.  In  [2]  the  term  ultrahigh-reliability  has 
been  used  to  indicate  the  stringent  reliability  requirements  demanded 
by  these  nonrepairable  computer  systems.  With  such  demands  on  the 
reliability  measure,  mathematical  modeling  is  often  the  only  means  of 
assessing  and  comparing  the  reliability  of  two  or  more  ultra-reliable 
systems. 

Several  mathematical  models  for  reliability  prediction  have  been 
developed  and  are  available  as  software  packages  ((1],  [3],  [9]). 
These  packages  accept  a  high-level  description  of  a  fault-tolerant  sys¬ 
tem  and  internally  generate  an  equivalent  mathematical  model  of  the 
system.  Various  dependability  measures  are  then  derived  by  solv¬ 
ing  the  mathematical  model.  For  a  review  and  critical  evaluation 
of  several  of  these  models,  we  refer  the  reader  to  [2]  and  [5].  The 
ARIES  model  proposed  by  Ng  and  Avizienis  [7]  is  a  unified  reliabil¬ 
ity  model  developed  for  evaluating  fault-tolerant  computer  systems. 
Based  on  the  Markov  modeling  technique,  it  unifies  most  reliability 
models  proposed  In  the  1970's.  Further,  the  solution  for  the  reliabil¬ 
ity  function  R(t)  is  an  analytical  expression  which  is  in  the  form  of 
a  weighted  sum  of  pure  negative  exponentials.  Hence,  the  solution  is 
completely  specified  by  a  set  of  parameter  pairs  (parameters  of  the 
exponentials  and  the  corresponding  multipliers)  which  depend  on  the 
fault-tolerant  system  under  evaluation.  However,  the  solution  is  of 
this  form  only  if  the  state  transition  rate  matrix  of  the  Markov  chain 
is  diagonalizabie. 

Let  (8  denote  the  transition  rate  matrix.  The  solution  ap¬ 
proach  in  ARIES  is  to  compute  the  matrix  function  e1*'  using  the 
Lagrange-Sylvester  interpolation  formula.  The  state  probability  vec¬ 
tor  is  then  given  by  P{t)  =  eA,P{ 0)  and  the  reliability  function  by 
R(t)  =  E.c...—  P.U).  From  standard  matrix  theory,  if  the  eigen¬ 
values  of  (B  are  distinct,  then.  (B  is  diagonalizabie  so  that  the 
minimal  polynomial  of  (B  has  distinct  roots.  For  this  case,  the 
Lagrange-Sylvester  formula1  is  a  valid  solution  technique.  If  the 
eigenvalues  are  not  distinct,  however,  (B  may  or  may  not  be  diag¬ 
onalizabie.  For  this  case  of  repeated  eigenvalues,  ARIES  considers 
only  the  distinct  eigenvalues  (ignoring  duplicates)  and  solves  as  in 
the  previous  case.  Since  this  is  a  valid  approach  only  when  (B  is 
diagonalizabie,  it  is  not  clear  that  the  ARIES  results  are  correct  for 
ail  systems  with  repeated  eigenvalues.  In  [2],  Geist  and  Trivedi  have 
given  an  example  of  an  ARIES  closed  fault-tolerant  system  in  which 
the  eigenvalues  of  the  transition  rate  matrix  are  not  distinct.  Hirther. 
they  have  observed  that  the  set  of  eigenvectors  of  the  transition  rate 
matrix  are  independent  (i.e.,  (B  is  diagonalizabie)  and  that  classical 
solution  methods  yield  correct  results.  In  this,  paper  we  prove  that 
for  ARIES  closed  systems.  (B.  the  transition  rate  matrix,  is  always 
diagonalizabie  so  that  the  Lagrange-Sylvester  formula  is  applicable 
despite  repeated  eigenvalues.  Further,  since  our  proof  guarantees  that 
the  rate  matrix  is  always  diagonalizabie,  general  methods  for  solving 
arbitrary  Markov  chains  can  be  tailored  to  solve  the  ARIES  model 
efficiently. 

H.  ARIES  Model  and  Solution  Technique  for  Closed 
Fault-Tolerant  Systems 

We  begin  with  a  brief  description  of  the  ARIES  model  for  closed 
fault-tolerant  systems  (for  a  detailed  description,  see  [6]  and  [7]). 
The  input  to  ARIES  consists  of  a  set  of  parameters  which  completely 
specifies  the  fault-tolerant  system.  The  definition  of  these  parameters 
[6]  is  reproduced  here  for  easy  reference. 

N  =  Initial  number  of  modules  in  the  active  configuration 

5  =  Number  of  spare  modules 

1  This  formula,  described  by  (2.2)  in  this  paper,  is  applicable  only  if  (9  is 
diagonalizabie. 


001 8- 9340/90/0400-0571  SOI. 00  ©  1990 IEEE 


REFERENCE  NO.  8 


Lopez- Benitez,  N.  and  Fortes,  I.  A.  B.,  "Detailed  Modeling  and  Reliability  Analysis 
of  Fault-Tolerant  Processor  Arrays,"  IEEE  Transactions  on  Computers,  Vol.  41,  No. 
9,  September  1992,  pp.  1193-1200. 

Note  -  This  paper  presents  an  efficient  approach  to  the  problem  of  modeling  and 
evaluating  the  reliability  of  fault- tolerant  processor  arrays.  It  also  reports  results 
obtained  by  using  a  software  package  that  implements  the  proposed  approach. 


iEEE  TRANSACTIONS  ON  COMPUTERS,  VOL  41.  NO.  4,  SEPTEMBER  1992 


[4|  T.  (Clave.  Error  correcting  codes  for  the  asymmetric  channel."  Rep. 

18-09-07-81.  Dep.  Mathematics,  Univ.  Bergen.  July  1981. 

[5]  D.  J.  Lin  and  B.  Bose.  “Theory  and  design  of  t -error  correcting  and 
■  /  (<t  >  f)  unidirectional  error  detecting  (f-EC  J-UED)  codes."  IEEE 
Trans.  Com  pat.,  vol.  C-37.  pp.  433-439.  Apr.  1988. 

(6|  F.  J.  MacWilliams  and  N.  J.  A.  Sloane.  The  Theory  of  Error-Correcting 
Codes.  Amsterdam.  The  Netherlands:  North-Holland.  1977. 

[7]  R.  J.  McEliece.  The  Theory  of  Information  and  Coding.  Reading,  MA: 
Addison-Weslev,  1977. 

[8]  0.  Nikolos.  "Theory  and  design  of  r-error  correcting/d-error  detecring 
(d  >  t)  and  all  unidirectional  error  detecting  codes."  IEEE  Trans. 
CompuL.  vol.  C-40.  pp.  132-142.  Feb.  1991. 

[9|  D.  Nikolos.  N.  Gaitanis.  and  G.  Philokvprou.  “Systematic  r-error  cor- 
recttng/all  unidirectional  error  detecting  codes."  IEEE  Trans.  CompuL. 
vol.  C-35.  pp.  394 — 102.  May  1986. 

( 10|  D.  1C  Pradhan  and  S.  M.  Reddy.  “Fault-tolerant  failsafe  logic  networks," 
in  Proc.  IEEE  COMPCON.  San  Francisco.  CA.  Mar.  1977.  p.  363. 


Detailed  Modeling  and  Reliability  Analysis 
of  Fault-Tolerant  Processor  Arrays 

N.  Lopez-Benitez  and  J.  A.  B.  Fortes 


Abstract — A  method  for  the  generation  of  detailed  models  of  fault- 
tolerant  processor  arrays,  based  on  Stochastic  Petri  Nets  fSPN)  is  pre¬ 
sented  in  this  paper.  A  compact  SPN  model  of  the  array  associates  with 
each  transition  a  set  of  attributes  that  includes  a  discrete  probability 
distribution.  Depending  on  the  type  of  component  and  the  reconfiguration 
scheme,  these  probabilities  are  determined  using  simulation  or  closed- 
form  expressions  and  correspond  to  the  survival  of  the  array  given  that  a 
number  of  components  required  by  the  reconfiguration  process  are  faulty. 

Index  Terms — Fault- tolerance.  Markov  models,  processor  arrays,  reli¬ 
ability.  stochastic  Petri  nets. 


I.  Introduction 

As  is  the  case  with  many  systems.  Markov  models  can  be  used 
to  evaluate  the  reliability  of  processor  arrays.  However,  reliability 
estimations  are  mostly  based  on  the  failures  of  processing  elements 
only  [1],  Components  other  than  processing  elements  become  very 
important  in  the  analysis  of  fault-tolerant  processor  arrays  because 
of  their  susceptibility  to  faults  and  the  added  hardware  complexity 
of  the  overall  array.  This  fact  has  played  an  important  role  in  the 
derivation  of  the  mathematical  framework  developed  by  Keren  et  ai 
[2|  to  evaluate  yield  improvement  and  performance-related  measures 
of  different  array  architectures.  A  detailed  modeling  of  fault-tolerant 
processor  arrays,  which  explicitly  takes  into  consideration  the  failure 
statistics  of  each  component  as  well  as  their  possible  interdependen¬ 
cies,  entails  not  only  an  explosive  growth  in  the  model  state  space 
but  also  a  difficult  model  construction  process.  This  paper  proposes 
a  systematic  method  to  construct  Markov  models  for  evaluating  the 

Manuscript  received  September  11.  1990:  revised  October  21.  1991.  This 
work  was  supported  in  part  by  the  Innovative  Science  and  Technology  Office 
of  the  Strategic  Defense  Initiative  Organization  and  was  administered  through 
the  Office  of  Naval  Research  under  Contracts  U0014-85-k-0588  and  00014- 
88-k-0723. 

N.  Lopez-Benitez  is  with  the  Department  of  Electrical  Engineering. 
Louisiana  Tech  University,  Rust  on,  LA  71272. 

J.  A  B.  Fortes  is  with  the  School  of  Electrical  Engineering,  Purdue 
University,  Wear  Lafayette.  IN  47907. 

IEEE  Log  Number  9200308. 


1193 

reliability  of  processor  arrays.  The  method  is  based  on  the  premise 
that  the  faull  behavior  of  a  processor  array  can  be  modeled  by  a 
Stochastic  Petri  Net  (SPN).  However,  in  order  to  obtain  a  more 
compact  and  efficient  representation,  a  set  of  attributes  is  associated 
with  each  transition  in  the  Petri  net  model.  This  set  of  attributes 
allows  the  construction  of  the  corresponding  Markov  model  as  the 
reachability  graph  is  being  generated.  Included  in  these  attributes  is 
a  discrete  probability  distribution  such  that  the  effect  of  faulty  spares 
in  the  reconfiguration  algorithm  is  captured  each  time  a  configuration 
change  occurs.  This  distribution  includes  the  probabilities  of  survival 
given  that  a  number  of  components  required  by  the  reconfiguration 
process  are  faulty.  Depending  on  the  type  of  component  and  the 
reconfiguration  scheme,  these  distributions  are  determined  using 
simulation  or  closed-form  expressions.  The  application  of  this  method 
is  illustrated  by  the  detailed  analysis  of  the  SRE  reconfiguration 
scheme  [3];  also,  analytical  expressions  to  estimate  probabilities  of 
survival  are  derived  for  this  scheme.  Other  applications  that  include 
schemes  such  as  ARCE  (Alternate  Row-Column  Elimination)  [3]  and 
DR  (Direct  Reconfiguration)  [4],  are  reported  in  [5]. 

Once  the  Petri  net  model  and  the  corresponding  reachability  graph 
have  been  obtained,  all  the  information  required  to  build  the  transition 
matrix  of  the  corresponding  Markov  chain  is  available.  Reliability 
evaluation  tools  such  as  ARIES  [6]  and  SHARPE  [7]  can  be  used  to 
evaluate  the  models  developed  here. 

The  second  section  of  this  paper  discusses  some  basic  notation 
and  concepts  which  include  array  configurations  and  Petri  nets:  also, 
the  large  state  space  inherent  in  the  models  generated  is  illustrated. 
The  third  section  discusses  MSPN  models  as  an  extension  of  SPN’s. 
Throughout  these  two  sections  the  SRE  reconfiguration  scheme  is 
used  as  an  application.  In  the  fourth  section  a  procedure  used  in 
generating  the  reachability  graph  is  described.  Finally,  results  on 
reliability  analysis  are  reported  in  Section  V. 

II.  Preliminaries 

A.  Array  Configurations 

To  analyze  a  fault-tolerant  array  architecture  with  k  types  of 
components,  the  configuration  of  an  array  is  represented  as  a  L-tupie: 

C,  =  (7i,.  ifc,.  •••.'?*■)  <  =  0. 1.  .|C| 

where  7/i  denotes  the  number  of  elements  of  component  type  l  and 
C  is  the  set  of  all  possible  configurations  of  the  array.  Examples  of 
component  types  include  processing  elements,  links,  switches,  spare 
links,  and  spare  processing  elements.  The  occurrence  of  faults  and 
the  application  of  the  reconfiguration  algorithm  define  a  sequence 
of  configurations  that  begins  with  Co  as  the  initial  configuration; 
any  other  configuration  can  correspond  to  the  failure  state  or  an 
operational  state  of  the  array.  The  latter  will  be  referred  to  as  an 
operational  configuration. 

Upon  detection  of  a  faulty  component,  the  reconfiguration  algo¬ 
rithm  may  not  send  the  array  to  an  operational  configuration  if  any 
of  the  following  happens: 

1)  The  reconfiguration  circuitry  failed.  This  possibility  can  be 
considered  through  a  coverage  factor  (denoted  by  c)  defined 
as  the  probability  of  successful  reconfiguration  given  that  a 
fault  has  occurred  [8].  This  is  a  measure  of  the  probability  of 
successful  operation  of  all  circuitry  related  to  fault  detection, 
isolation,  and  reconfiguration.  The  coverage  factor  is  assumed 
constant  and  it  will  be  associated  with  failures  of  active 
components  only. 


0018-9340/92S03.00  ©  1992  IEEE 


1194 


,EEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41.  NO.  9.  SEPTEMBER  1992 


2)  Redundancy  is  exhausted.  This  information  :an  be  inferred 
from  C,. 

3)  The  presence  of  fauits  in  nonactive  components  (redundancy) 
hinders  a  successful  reconfiguration.  Redundant  components 
are  present  in  C,  as  spare  processing  elements,  spare  switches, 
spare  links,  spare  buses,  etc.  Some  of  these  components  become 
active  in  the  new  configuration. 

In  a  given  configuration  with  a  number  of  faulty  components,  a 
successful  reconfiguration,  will  depend  not  only  on  the  type  of  faults 
but  also  on  their  distribution  in  the  array.  Thus,  the  probability  of  a 
correct  reconfiguration  in  the  presence  of  fauits  is  referred  to  as  the 
probability  of  survival  [4],  Because  the  reconfiguration  algorithm  may 
choose  one  of  several  new  configurations  (including  a  nonoperational 
one),  a  probability  is  assigned  to  each  possible  new  configuration. 
The  probability  of  survival  corresponds  to  the  sum  of  probabilities 
assigned  to  new  operational  configurations. 

B.  The  Largeness  Problem 

Consider  for  example  an  n  x  n  array  that  supports  the  Successive- 
Row- Elimination  (SRE)  reconfiguration  scheme  with  a  layout  as  in 
Fig.  1  (where  n  =  4).A  row  of  Switches  (S)  and  Spare  Bypass  Links 
bypass  all  the  Processing  Elements  (PE’s)  in  any  row  containing  at 
least  one  faulty  PE  or  at  least  one  faulty  horizontal  link  (HL)  or  at 
least  one  faulty  input/output  link  (IOL);  spare  bypass  links  (SBL's) 
then  become  active  Bypass  Links  (BL);  the  array  performance 
degrades  gracefully  as  rows  are  eliminated.  The  minimum  working 
configuration  consists  of  a  single  functional  row  of  PE’s.  However, 
all  active  bypass  links  and  ail  switches  must  be  fault-free.  The  amy 
fails  if  operational  rows  are  exhausted  or  a  single  switch  or  an  active 
bypass  link  have  failed.  Assuming  only  PE's  fail,  a  Markov  chain 
will  contain  n  operational  states  (3).  For  each  operational  state  <  in 
the  Markov  chain  there  will  be  one  transition  to  an  operational  state 
j  with  a  rate  denoted  as  A„  and  one  transition  to  the  failure  with  a 
rate  A,/.  If  (he  failure  rate  of  a  PE  is  ai,  then  these  transition  rates 
can  be  evaluated  as  A,;  =  p.iatcandA,/  =  7,101(1  —  c)  where 
7;  1  corresponds  to  the  number  of  PE’s  active  when  the  process  is  in 
state  <  and  c  is  a  constant  coverage  factor. 

Assume  now  that  not  only  PE's  fail  but  also  switches  are  suscep¬ 
tible  to  failure  with  a  rate  as.  Note  that  the  failure  of  a  single  switch 
will  cause  the  array  to  fail:  therefore,  the  transition  rate  to  the  failure 
state  is  modified  accordingly;  i.e..  A,/  =  7,101  ( 1  -  c)  +  7,3 as 
where  7:3  corresponds  to  the  number  of  switches  when  the  process  is 
in  state  1.  In  addition  to  failures  of  PE’s  and  switches,  assume  failures 
of  links  including  spare  and  active  bypass  links  can  occur.  Failures  of 
I/O  links  and  horizontal  links  can  be  treated  as  PE  failures  in  the  sense 
that  the  same  state  will  be  generated  but  with  a  different  transition 
rate.  Of  particular  interest  is  the  failure  of  spare  bypass  links,  because 
failures  of  this  type  will  generate  a  large  state  space.  For  n  -  4 
the  number  of  states  increases  to  44  as  shown  in  Table  I.  Although 
it  is  possible  to  construct  the  model  manually,  the  process  has  now 
become  complex  and  time  consuming.  In  addition,  nondistinct  states 
must  be  merged  and  the  survivability  and  coverage  factors  must  be 
aggregated  to  estimate  transition  rates. 

The  maximum  number  of  states  Ssre(r)  generated  in  terms  of 
the  size  n  of  the  array  is  tabulated  in  Table  1.  An  expression  of 
5sR£(n)  is  obtained  by  observing  that  n  states  are  generated  with 
nonfauity  spare  bypass  links:  for  each  operational  state  i.  n(  n  -  /) 
additional  states  can  be  generated  with  at  least  one  faulty  spare  bypass 
link.  Hence,  Ssre(R)  =  E.Vil(,,(n  ~  *)  +  1|.  Simplifying  this 
expression,  the  maximum  number  of  states  is  given  by 

_  1  ,  ,  % 

SsREin)  —  ^(t>  +"  4-  2n). 


Fig.  1.  Schematic  layout  of  the  SRE  structure. 

Similar  expressions  have  been  derived  for  the  ARCE  and  DR  schemes 
[S]  whose  values  are  shown  in  Table  I  for  several  instances  of  n. 

The  complexity  involved  in  constructing  Markov  chains  with 
detailed  modeling  obviates  the  need  for  a  higher  level  representation 
such  as  the  Modified  Stochastic  Petri  Nets  proposed  in  this  paper. 
This  type  of  modeling  is  useful  in  representing  the  fault  behavior  of 
the  anay  in  the  presence  of  different  types  of  faults  and  in  generating 
the  Markov  chain  for  any  array  size  and  in  the  format  required  by 
the  chosen  evaluation  package. 


C.  Pern  Sets 

Petri  nets  can  be  used  to  analyze  complex  systems  with  interdepen¬ 
dent  components.  A  formal  definition  of  a  Petri  net  is  presented  in  [9) 
as  a  4-tuple  PS  —  (F.  T.  .4.  Mo)  where  P  =  (pi,  pi-  ■  ••  •  Pk } 
is  the  set  of  places.  T  —  { ti .  fj,  ••  •  .  t, }  is  the  set  of  transitions. 
.4  C  [P  x  T}  U  (T  x  P]  is  the  set  of  input  and  output 
arcs  and  Mo  is  the  initial  marking.  Each  place  contains  a  number 
of  tokens.  A  marking  in  a  PN  is  represented  by  a  A- -component 
vector  M  =  [mi .  m?.  •••  . m«|  where  m,  is  the  number  of 
tokens  in  place  pi.  The  set  of  input  places  I  and  the  set  of  output 
places  O  are  mappings  defined  with  res  pea  to  a  transition  t.  as 
/(M  =  {p|(p.f,)<A}  and  0(t,)  =  {p|(t,,p)<.4h  respectively. 
The  number  of  tokens  associated  with  an  arc  is  referred  to  as  the 
multiplicity  of  the  arc.  A  transition  is  enabled  when  a  PN  is  in  marking 
M  such  that  all  input  places  of  t,  contain  a  number  of  tokens  greater 
than  or  equal  to  the  multiplicities  associated  with  the  corresponding 
arcs.  A  transition  fires  if  it  is  enabled.  When  a-  transition  fires, 
tokens  are  removed  from  its  input  places  and  added  to  its  output 
places  according  to  the  multiplicities  of  the  corresponding  arcs.  This 
determines  a  new  marking.  The  reachability  set  of  a  PN  is  the  set 
of  all  markings  reachable  from  Mo  through  a  sequence  of  firings. 
The  reachability  graph  of  a  PN  is  a  directed  graph  whose  vertices 
are  the  elements  of  the  reachability  set  and  whose  arcs  correspond 
to  transition  firing  in  the  PN. 

III.  Modified  Stochastic  Petri  Nets 
.4.  SPN  Representation 

Stochastic  Petri  Nets  extend  the  Petri  Net  concept  by  associating 
random  firing  times  with  each  transition.  This  extension  allows  a 
one-to-one  mapping  between  the  markings  generated  by  a  Stochastic 
Petri  Net  and  the  operational  states  of  an  homogeneous  Markov  chain 


EEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41.  NO.  9.  SEPTEMBER  1992 


1193 


TABLE  I 

Max-mum  State  Space  Size  for  Three  Reconfiguration  Schemes 

I  a  2  3  4  5  67  8  9  ~7o 

Ssn£[n)  8  21  44  80  132  203  296  414  560 

Sarce  l")  17  51  107  199  333  517  759  1067  1449 

SoR(n)  75 _ 188 _ 385 _ 690 _ U27 _ 1720 _ 2493 _ 3470 _ 4675 


(assuming  exponentially  distributed  firing  times)  [10].  Thus.  SPN’s 
are  defined  by  a  5-tuple  ( P.  T.  .4.  M.  a)  where  a  =  {an.  a?, 
■■■  .a,}  specifies  the  firing  rates  of  transitions  To 

model  fault-tolerant  processor  arrays,  an  operational  configuration 
corresponds  to  an  operational  state  in  the  Markov  chain;  thus,  to 
derive  ail  possible  operational  configurations  of  the  airay,  a  marking 
n  the  SPN  must  correspond  to  an  operational  configuration  of 
the  array.  Each  place  p,  identifies  components  of  type  <  and,  at 
a  given  marking  .V/7,  the  number  of  tokens  m,7  corresponds  to 
/>.,  which  is  the  number  of  components  of  type  i.  Two  or  more 
distinct  component  types  may  identify  the  same  physical  component; 
for  example,  a  physical  spare  is  a  component  of  the  type  "active 
spare”  when  it  is  used  to  replace  or  bypass  a  faulty  part  and  it 
is  a  component  of  the  type  ‘nonactive  spare”  otherwise.  Consider 
the  n  x  n  SRE  array  in  Fig.  1;  a  marking  M ,  is  described  as 
l/7  =  dfiPE.#IOL.ifiS.ifiBL.1fiRL.#SBL)  = 

• .  m«,l  where  the  symbol  is  used  to  denote  "number  of.” 

A  possible  SPN  representation  is  given  in  Fig.  2.  The  tiring  of 
i .  represents  the  occurrence  of  a  fault  in  a  PE,  a  fault  in  an  IOL 
is  represented  jy  the  firing  of  t?  and  so  on.  In  general,  the  firing 
of  t,  represents  a  fault  occurrence  in  a  component  of  type  <  where 
1  <  i  <  k  and  k  is  the  number  of  places  and  transitions.  In  SRE, 
component  types  1  through  6  correspond  to  PE  IOL.  S.  BE  HE 
SBL  and  k  =  6.  respectively. 

Although  the  SPN  in  Fig.  2.  might  provide  the  number  of  op¬ 
erational  configurations  required,  it  fails  to  consider  the  cases  when 
enough  spares  are  available  but  reconfiguration  cannot  take  place  (due 
tor  example  to  a  peculiar  distribution  of  faults).  As  a  consequence, 
this  approach  might  provide  overly  optimistic  reliability  estimates. 
Conceivably,  a  different  SPN  model  can  be  used  to  accurately 
represent  (he  dependency  of  successful  reconfigurations  on  fault 
distributions.  However,  such  a  SPN  would  itself  consist  of  a  large 
number  of  places  and  transitions  (which  increases  with  the  size  of 
the  array)  that  would  generate  a  large  sate  space  in  the  underlying 
Markov  chain.  One  of  the  intena  of  this  paper  is  to  provide  an 
extension  of  the  SPN  concept  so  that  dependence  on  fault  distribution 
can  be  accounted  for  in  a  model  with  a  complexity  comparable  to 
that  of  Fig.  2  regardless  of  the  size  of  the  array  and  reconfiguration 
scheme  used. 

3.  S4SPS  Representation 

The  fact  that  several  types  of  faults  affect  the  array  in  the  same 
manner,  suggests  a  more  compact  SPN-like  representation,  which  is 
referred  to  as  Modified  SPN  (MSPN). 

Definition:  .An  MSPN  is  defined  as  MS  Py  =  (P.  T.  .4.  Mo. 
Pr.  Sq.  B.  Cr  i  where  P  =  (pi.  ••.  p*}  is  the  set  of  places, 
r  a  (tt  .  ■  •  • .  t, }  is  the  set  of  transitions.  .4  C  {P  x  P]  IJ  [T  x 
P)  is  the  set  of  input  and  output  arcs.  Mo  is  an  initial  marking.  Pr 
is  x  set  of  probability  functions  Pfxj.V,,  t, )  associated  to  transitions 
\tT,  Sq  is  a  set  of  sequences  Si  of  transitions  that  fire  immediately 
utter  an  exponential  firing  of  transition  t,tT.  B  is  the  set  of  binary 
ransioon  vectors  B,  defined  for  each  t,«T,  C v  is  the  set  of  coverage 
constants  c.  associated  with  each  t,tT.  □ 

An  MSPN  represen  taoow  associates  a  set  of  attributes  with  each 
transition  f,;  i-«~  <  PLr|Aff.t,).Si(.M,).  B,.c,  >  where: 

P(x|.V/7.t, )  defines  a  discrete  probability  function  where  x  rep¬ 


resents  l  random  marking  M,  in  a  set  R  directly  reachable  from 
a  particular  marking  .\/7;  the  notation  Pr‘v  is  used  to  denote 
the  probability  of  reconfiguration  P(x  =  .\/)|.l/7.t, ),  i.e.,  the 

probability  that  the  net  is  in  marking  M,  after  t,  fires  when  the 
net  is  in  marking  .V/7; 

5i(A/7)  is  a  sequence  of  transitions  that  wiil  fire  immediately 
after  f,  fires.  If  no  immediate  firing  is  required  then  S,(.Vf7)  is  a 
null  sequence.  Depending  on  the  reconfiguration  scheme,  5,  can  be 
identical  for  all  markings  or  can  be  determined  in  terms  of  .U7; 

3,  is  a  binary  transition  vector  with  k  elements  bt  such  that,  bi  =  1 
if  the  failure  of  the  1th  component  triggers  the  transition  t,  and  6i  —  0 
otherwise.  This  vector  facilitates  the  merging  of  nondistinct  markings, 
the  derivation  of  probability  transition  vectors  (defined  in  Section  IV), 
and  the  derivation  of  dags  (i.e.,  failure  rate  conditions)  that  signal  a 
possible  nonoccurrence  of  a  transition; 

c,  is  a  coverage  factor  associated  with  t,,  such  that  if  t,  is  triggered 
by  the  failure  of  a  spare  (inactive)  component  then  c,  =  1;  if  t,  is 
triggered  by  the  failure  of  an  active  component  then  c,  corresponds 
to  the  probability  of  detection  given  that  a  fault  occurs  (i.e..  c,  <  1). 

For  each  transition  in  a  MSPN.  there  is  a  set  .4,  of  I/O  places 
and  a  set  V,  of  multiplicities  p  associated  with  each  arc  in  .4.. 
When  the  MSPN  is  in  a  marking  ,\/7  and  a  transition  t,  fires, 
the  number  of  tokens  m(7  in  each  place  pnA.  is  then  modified 
as  follows:  m(,  =  mi,  -  p[  +  pf.  resulting  in  the  generation  of 
a  new  marking  .V/;.  Since  the  input  multiplicities  pj  and  output 
multiplicities  pf .  can  be  functions  of  the  size  of  the  array  and  the 
current  marking,  they  are  referred  to  as  variable  multiplicities :  this 
concept  is  a  natural  extension  to  the  usual  notion  of  "multiplicities” 
in  Petri  Net  theory  and  it  has  been  used  in  the  development  of  the 
Stochastic  Petri  Net  Package  (SPNP)  [11].  The  notion  of  “immediate” 
firing  was  introduced  in  the  Generalized  Stochastic  Petri  Net  model 
[12]  with  the  classification  of  timed  and  immediate  transitions.  The 
former  fire  after  a  random  exponentially  distributed  enabling  time: 
the  latter  fire  as  soon  as  they  are  enabled.  Similar  distinction  ;s 
made  in  the  Stochastic  Activity  Network  (SAN)  models  used  in 
developing  Metasan  [13];  in  a  SAN  model,  instantaneous  activities 
occur  in  a  negligible  amount  of  time.  Immediate  firings  in  a  MSPN 
capture  a  sequence  of  structural  changes  that  may  occur  when 
the  reconfiguration  procedure  is  triggered  and  spare  support  is  not 
available.  However,  these  structural  changes  must  be  preserved  as 
states  with  transition  rates  modified  by  the  distribution  P(x|.V/7,  t, ) 
(see  Fig.  4).  By  specifying  in  5,  which  transitions  tire  immediately, 
there  is  no  need  of  an  explicit  represenution  of  immediate  transitions 
for  each  t,  with  additional  places  to  control  the  firing  sequence 
until  a  failure  marking  is  reached.  It  is  important  to  note  that  the 
association  of  P(xi.Vf7.f, )  allows  the  simplification  of  the  model 
and  therefore  the  reduction  of  the  state  space.  Otherwise,  additional 
places  and  transitions  must  be  created  to  model  the  effect  of  faulty 
spares  distributed  in  the  array. 

As  an  example,  a  MSPN  represenation  of  SRE  given  in  Fig.  3. 
describes  the  fault  model  of  an  n  x  n  array.  Since  ft,  fa,  and  fs. 
in  Fig.  2.  have  the  same  effect  on  the  array,  a  single  transition  ti 
is  defined  in  Fig.  3.  with  a  vector  B\  =  [1  l  0  0  1  0)  to  indicate 
that  either  the  failure  of  a  PE  IOL  or  a  HE  can  cause  ti  to  fire. 
Likewise,  to  and  f«  become  t7  with  a  vector  Bo  —  [00  1  1  0  0]. 
The  firing  of  fj  represents  the  failure  of  a  BL  or  a  S  either  of  which 


11% 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41.  NO.  9.  SEPTEMBER  1992 


Fig.  2.  SPN  of  the  SRE  scheme. 


is  '  'al.  Transition  t«  becomes  tj  and  represents  the  failure  of  a  SBL 
with  a  vector  Bs  =  [0  0  0  0  0  1]. 

Consider  now  the  case  when  a  given  operational  configuration 
contains  faulty  SBL's.  The  probability  that  none  of  them  lies  in  the 
row  that  is  eliminated  when  ti  fires,  orresponds  to  the  probability 
of  survival  Pa  such  that  Pr\3  =  Pa.  If  at  least  one  SBL  is 
faulty  in  the  row  deleted,  the  array  fails  to  reconfigure  with  a 
probability  Pr\f  =  1  -  Pa  (the  failure  marking  is  denoted  as 
Mf)\  i.e..  1 2  (ires  immediatedly  and  Si  =  (fa).  This  follows  from 
the  fact  that  when  (|  fires,  the  required  SBL's  become  BL's  and 
if  one  of  them  is  faulty  ti  fires  to  generate  a  failure  marking. 
In  general,  probabilities  of  survival  are  complicated  functions  of 
the  characteristics  of  each  reconfiguration  scheme  (e.g.,  replacement 
rules,  hardware  requirements,  dependencies  on  fault  distributions, 
etc.)  and  the  size  and  shape  of  the  array.  They  must  be  derived  for 
each  scheme  and  in  some  cases  extensive  simulation  is  required  due 
to  the  complexity  of  the  combinatorial  analysis  involved.  Examples 
where  suitable  expressions  have  been  derived  include  SRE,  ARCE. 
and  DR  [5].  For  SRE.  Pa  can  be  estimated  via  (5.2)  where  .V 
corresponds  to  the  number  of  faulty  SBL's  present  in  the  current 
marking  .V/7.  Note  that  both  values  Pa  and  1  -  Pa  form  a  discrete 
distribution  which  is  associated  with  tt  as  indicated  in  Fig.  3.  When 
tj  or  t3  fire  exponentially,  no  immediate  transition  firing  is  required 
and  5a  and  5a  are  null  sequences.  In  some  applications  such  as 
ARCE  [5],  the  sequence  Si  is  not  unique  as  it  is  selected  depending 
on  the  current  marking. 

The  set  of  multiplicities  Vi  =  {p{  =  n.  p?  as  2n.p*  = 
n  —  1.  nit  *  r».p?  =  n},  indicates  how  the  I/O  places  in  .4t  = 
{{pi-Pa-Ps.fM.  {/>«)}  are  affected  when  fires;  the  failure  of  a  S 
or  a  BL.  will  cause  the  transition  h  to  fire.  Since  these  failures  are 
fatal,  all  places  are  affected  and  .-Is  =  {{pi  pa}. 2 J}  with  a 
set  of  multiplicities  (labeled  “all"  in  Fig.  3)  V  ={p[  =  #PE?, 
m «s  =  #/0/,,  mi  =  #$•?;  Pa  =  #BLn;  pi  = 

Pa  as  #551,).  which  sets  all  places  to  zero  to  indicate  a  failure 
marking.  Finally,  the  failure  of  a  SBL  (t«)  affects  only  SBL's;  i.e.. 
Ai  =  {{p«}..3}  with  Vi  =(p«  =  1}. 

IV.  Reachability  Graph 

A.  Probability  Transition  Vectors 

For  each  marking  A/,  generated  when  t,  fires.  B,  and  the  distribu¬ 
tion  function  Pfx|.V/,,  f, )  can  be  used  to  generate  vectors  of  the  form 


Fig.  3.  MSPN  of  the  SRE  scheme. 


Pt\} B,  =  [pr,  •  ■  •  pr*]  where  pr(  =  Pr\j  if  bt  =  1  and  pr,  -  0 
if  bt  —  0.  These  vectors  are  referred  to  as  the  Probability  Transition 
Vectors  (PTV’s).  In  the  event  that  two  or  more  nondistinct  new 
markings  are  generated  from  the  same  marking,  a  merging  to  a  single 
new  marking  is  carried  out  by  a  vector  addition  of  the  corresponding 
probability  transition  vectors.  The  use  of  PTV’s  is  described  in  Fig. 
4.  Assuming  j  =  1. 2.  •  •  • .  |  Al,  then  Pr‘tJ  -  1  where  R  is 
the  set  of  markings  generated  by  the  tiring  of  t,. 

Example  I:  Analyzing  a  particular  marking  (in  a  4  x  4  array  with 

a  MSPN  as  in  Fig.  3)  say  ,V/„  =  [12  24  20  4  0  11],  then 

if  fi  fires  (i.e..  either  a  PE,  a  (OL  or  a  HL  failed)  the  marking 
.V/io  =  [8  16  20  3  6  7]  results  with  Pr{*  30  a*  667.  Thus, 
a  PTV  is  given  by  Pr]*  30Bi  =  [.667  .567  0  0  .667  0}.  However 

the  array  may  fail  with  Pr}8  ,  =  .333  due  to  the  existence  of 

one  faulty  SBL  (#PE's  -  #SBL's  *  12  -  11  *  1).  Thus,  a  PTV 
is  given  by  Pr \t  fB\  =  [.333  .333  0  0  .333  0J.  Note  that 

Pfis.jo  =  Pa  and  Pr]*./  =  1  -  Ps  where  Ps  is  estimated 


EEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  41.  NO.  9,  SEPTEMBER  1992 


1197 


using  (S.2).  When  h  fires  (i.e  a  S  or  a  BL  failed),  the  array  fails 
vith  probability  Prf,  f  =  1,  the  corresponding  PTV  is  given  by: 
?r'f, jBi  =  [0  0  1  1  0  0].  Therefore,  the  overall  PTV  associated 
vith  the  transition  to  the  failure  marking  when  t(  or  t2  fires  is  ob- 
ained  as  follows:  fBi  r  Prf8  fBi  =  [.333  .333  1  1 .333  0]. 

□ 

3.  Derivation  of  State  Transition  Rates 

The  state  transition  rates  a,,  between  operational  states  of  a 
continuous-time  Markov  chain  (CTMC)  can  be  expressed  in  terms  of 
the  attributes  associated  with  the  transitions  in  the  MSPN.  i.e..  aiz  = 
/( P(x|.V,.  t, ).  B,.cj ).  Details  on  CTMC  models  and  numerical 
solutions  are  given  in  [14). 

Let  Mn  =  =  l...fc]  where  <a,  is  the  failure  rate  of 

the  <th  component  and  m,q  is  the  number  of  such  components  in 
.\/7,  then 

'hi  -  [T".  c,Pr‘qlb,;  bjfBj.j  =  1 . k\\ff  (4.1) 

defines  the  transition  rate  from  state  q  to  an  operational  state  l.  The 
summation  is  defined  over  all  transitions  that  fire  exponentially  and 
generate  the  same  marking  /.  It  is  interesting  to  note  the  relationship 
of  (4.1)  with  the  firing  rate  of  a  particular  transition  ti  given  by 
the  vector  product  B,MJ .  Note  also  that  a  transition  ti  fires  only 
if  B,a  >  0.  A  transition  to  the  failure  state  occurs  for  lack  of 
support  (i.e.,  enough  spares)  or  lack  of  coverage.  In  the  first  case, 
lack  ot  support  occurs  if  the  reconfiguration  algorithm  failed  due  to 
exhaustion  of  spare  components  or  the  fact  that  the  array  fails  to 
reconfigure  if  a  given  distribution  of  faults  is  not  supported  by  the 
reconfiguration  algorithm.  Denote  by  \qf  the  transition  to  the  failure 
state  /  for  lack  of  support,  then  A7/  =  [£,  Pf*/6/;  b,eB,,j  — 
L.  •  •  • .  k\\tf.  Lei  \‘f  be  the  transition  to  /  for  lack  of  coverage  then: 

=  5j[y<l-*>P^>;  bjeBi,j  =  1 . k\MT) 

i  i 

The  overall  transition  rate  to  the  failure  state  is:  A7/  =  A7/  4-  \clf 
and  the  diagonal  term  of  the  matrix  A  is  calculated  as  follows: 

an  —  ~ +  A ,/). 

Example  2:  Consider  a  particular  marking  say  .V/lg  =  [12  24  20  4 
0  11),  then  the  following  failure  rates:  a0  =  1.  at  =  s  ••• 
'**  =  0.01,  yield  .Y/„  =  [12  0.24  0.20  0.04  0.09  .11].  As  in 
example  1,  the  following  probabilities  are  used:  Pr|8  30  =  0.667, 
Pr\B  f  =  0.333,  Prfa  /  =  1.  Let  ct  =  cj  at  c  then,  when  ft 
tires  the  following  transition  rates  in  the  Markov  chain  are  generated: 

■1  n  /  =  qQ.667  0.667  0  0  0.667  0]A/^  s  8.2241c.  The  transition 
;o  the  failure  state  due  to  lack  of  support  when  1 1  and  h  fire,  is 
At,./  =  ( [0.333  0.333  0  0  0.333  0|+[0  01  10  0])iVf^  =  4.345*9. 

For  lack  of  coverage  A'/  =  ( 1  — c)[.667  .667  0  0  .667  0] = 
7.2241(1-0.  The  overall  transition  rate  to  the  failure  state  is  given  as 
\,,./  =  3.2241(1  —  n  +  4.34589.  Considering  that  the  failure  of  a 
SBL(fo)  yields  a  transition  rate  at,  is  =  [00000  l)A/t^  =  0.11 
(hen  the  diagonal  term  is  calculated  as:  <n,.ts  =  -<  8.2241c  — 
4.34589  4- 8.2241(1  -c)  +  . 11)  =  -12.68.  □ 

C.  Implementation  Procedure 

The  implementation  procedure  MODELGEN  outlined  below 
merges  repeated  markings  as  they  are  being  generated  and  calculates 
or  modifies  the  transition  rates  in  the  process.  The  new  markings 
generated  every  iteration,  are  targets  of  the  currently  visited  marking; 
they  are  inserted  into  a  linked  list  of  markings  which  is  sorted  with 


respect  to  the  sum  of  the  components  indicated  by  previously  set 
comparison  flags.  An  array  of  pointers  to  the  newly  created  targets 
is  updated.  A  pointer  to  the  current  marking  is  denoted  by  cm.  A 
pointer  to  the  next  marking  in  the  linked  list  is  referred  to  by  next.  A 
systematic  indexing  of  markings  is  carried  out  such  that  the  resulting 
matrix  is  always  upper  triangular:  a  marking  number  is  assigned  to 
every  next  marking  fetched:  a  pointer  nxtopr  points  to  markings 
which  are  candidates  to  be  printed  or  saved  in  a  file.  A  marking 
becomes  a  candidate  for  output  if  ail  its  targets  have  been  numbered; 
therefore,  no  significant  amount  of  memory  is  required  to  generate 
large  models.  The  procedure  stops,  when  all  markings  have  been 
fetched  from  the  sorted  list.  Notice  that  the  linked  list  corresponds  to 
the  reachability  graph  and  as  it  is  formed,  both  a  matrix  representation 
of  the  Markov  model  and  the  reachability  graph  are  printed  out  or 
saved  as  requested  by  the  user. 

Procedure  MODELGEN 

Inputs :  Set  of  failure  rates  (a,); 

File  names  (to  save  the  matrix  representation  of  the  Markov 

model); 

An  MSPN  representation  of  the  reconfiguration  algorithm. 

Outputs:  Reachability  graph  description; 

Matrix  representation  of  the  Markov  model. 

Begin 

Set  a  coverage  flag  for  each  transition  that  is  fired  by 
nonactive  components. 

(this  allows  for  the  evaluation  of  a  symbolic  matrix  for 
different  coverage  values). 

Set  comparison  flags  (to  select  those  components  by 
which  the  list  of  markings  is  sorted). 

Load  initial  marking 
for  each  t,  let  s,  =  f?,ar 
Let  cm  point  to  initial  marking 
nxtopr  =  cm 
while  not  end  of  list  d» 
fetch  current  marking 
assign  a  number  to  current  marking 
for  each  t,  and  if  s,  >0  do 
get  P(x|:V/7,t, ) 

fire  t,  and  those  transitions  t,fS, 
calculate  transition  rates 
store  targets  in  temporary  table 
end  for 

merge  repeated  targets 
insert  new  targets  in  sotted  list 
insert  pointers  to  new  targets  in  current  marking 
while  all  targets  of  nxtopr  are  numbered  do 
output  nxtopr  marking 
let  nxtopr  =  nxtopr  —  next 
end  while 

let  cm  =  cm  —  next 
end  while 
end  procedure 

This  procedure  has  been  incorporated  into  a  software  package 
MGRE  (Model  Generator  and  Reliability  Evaluator)  described  in 
[5].  MGRE  generates  and  solves  reliability  models  given  the  re¬ 
configuration  scheme,  the  size  of  the  array  and  a  set  of  failure 
rates. 

The  execution  time  of  MODELGEN  is  proportional  to  the  number 
of  states  generated  and  therefore  depends  on  the  reconfiguration 
algorithm.  For  the  applications  discussed  in  [5],  the  execution  time 
is  0(nJ)  for  n  x  n  processor  arrays. 


1198 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL  41.  NO.  9,  SEPTEMBER  1992 


TABLE  II 

Failure  Rates  Used  tor  the  Results  Shown  in  Table  III 


Array 

col. 

«i 

m 

as 

a4 

05 

*6 

ar 

Exoltnation 

a 

sx 

• 

at*pe  f.rate 

b 

1 

0.0 

0.0 

0.0 

0.0 

0.0 

ajzlOt  t  rate 

c 

1 

0.01 

0.0 

0.0 

0.01 

0.0 

as*Switch  f.rate 

SRE 

d 

1 

0.01 

0.01 

0.01 

0.01 

0.01 

a4*b.itnk  f.rate 

e 

1 

0.01 

0.1 

0.01 

0.01 

0.01 

a«*h.link  f.rate 

f 

1 

0.01 

0.01 

0.1 

0.01 

0.1 

ae=sp.b.iiak  f.rate 

n 

1 

0.01 

0.01 

0.0 

0.01 

0.0 

- 

- 

a 

. 

- 

- 

- 

- 

- 

as*cjwttch  f.rate 

b 

L 

0.0 

0.0 

0.0 

0.0 

0.0 

0.0 

a4*r.switch  f.rate 

c 

1 

0.01 

0.0 

0.0 

0.01 

0.01 

0.0 

a<t*c.b.litik  f.rate 

AKCb 

d 

1 

0.01 

0.01 

0.01 

0.01 

0.01 

0.01 

u<t*r.b.linlc  f.rate 

e 

1 

0.01 

0.1 

0.1 

0.01 

0.01 

0.01 

aT*sp.b.Lf.rate 

_ 

1 

0.01 

0.01 

0.01 

0.1 

0.1 

0.1 

sx*simplex 

V.  Probability  Transition  Vectors  in  SRE 

For  an  n  x  n  processor  array  using  the  SRE  reconfiguration  scheme, 
the  initial  marking  Mo  of  the  reachability  graph  is  given  as  follows: 
pi  :  #PE  =  n2;  pi  :  #IOL  =  2n2;  p*  :  #5  =  n  +  n2; 
Pa  :  #BL  =  0;  ps  :  #HL  —  n(n-l);p*:  #5BI  =  n2. 

A  failure  marking  corresponds  to  the  case  when  =  0  or 

#SBL  <  0.  Transition  t{  will  take  place  if  either  at,  ai,  or  a*  is 
greater  than  zero;  t a  will  take  place  if  either  as  or  a*  is  greater  than 
zero.  Likewise,  to  takes  place  if  as  >  0.  Other  applications  include 
ARCE  [3]  and  the  Direct  Reconfiguration  (DR)  [4]  schemes,  reported 
in  [5]  and  FUSS  reported  in  [IS]. 

To  derive  PTV's.  estimations  of  probabilities  of  reconfiguration 
are  required  for  each  reconfiguration  scheme.  In  the  SRE  case,  its 
MSPN  indicates  (Fig.  2)  that  when  fi  fires,  two  transitions  in  the 
Markov  model  may  occur  (corresponding  to  a  sequence  of  two  firings 
in  the  MSPN:  ti  followed  by  r2);  one  with  probability  Pr^  which 
corresponds  to  the  probability  of  survival  denoted  by  Ps.  The  other 
transition  fires  immediately  and  will  lead  the  processor  array  to 
a  failure  state  with  probability  Pr'f.  The  array  will  survive  with 
probability  Ps  if  the  failure  of  a  PE  or  an  IOL  that  triggers  the 
reconfiguration  algorithm  occurs  in  a  row  that  contains  no  faulty 
SBL.  Let  .V  denote  the  number  of  faulty  SBL  in  a  marking  ,V/7 
then  .V  =  -  #SBL,  =  mt,  -  m«,  where 

0  <  .V  <  n  x  n. 

To  estimate  Ps,  it  is  necessary  to  find  all  the  possible  ways  in 
which  .V  faulty  SBL's  can  be  mapped  into  a  total  of  r  x  c  SBL’s 
in  the  array  with  a  current  configuration  containing  r  rows  and  c 
columns.  Because  each  row  contains  c  SBL’s,  up  to  c  faulty  SBL’s 
in  each  row  are  possible.  Let  X  denote  the  random  number  of  faulty 
SBL’s  in  any  operational  marking,  then  Ps  can  be  obtained  using 
the  following  hypergeometric  distribution: 

p,  -jMJt.,)-  (:)(*::)/(*)■  ei) 

This  expression  calculates  the  probability  that  x  faulty  SBL’s  exist 
in  the  row  to  be  eliminated  during  reconfiguration.  However,  for  the 
case  of  the  SRE  reconfiguration  scheme,  a  successful  configuration 
will  occur  only  if  x  =  0,  in  which  case  (5.1)  is  simplified  to  the 
following  expression: 

Ps  =  Pr(X  —  0)  =  (rC.vC)/(!v)-  (5'2) 

In  (5]  probabilities  of  survival  for  the  ARCE  and  the  DR  schemes 
are  derived  using  also  combinatorial  expressions. 


VI.  Reliability  Analysis 

To  illustrate  the  applicability  of  MODELGEN,  several  sets  of 
failure  rates  (a’s)  have  been  selected  and  described  in  Table  II. 
For  each  failure  rate  set;  the  Reliability  (R),  the  MTTF,  and  the 
Reliability  Improvement  Factors  (RIF)  have  been  computed  for  a 
4x4  processor  array.  The  results  obtained  are  tabulated  in  Table  III 
and  correspond  to  the  analysis  of  the  SRE  scheme  and  the  ARCE 
(Alternate  Row  Elimination  Scheme)  scheme  analyzed  in  [5].  In  our 
analysis  we  have  used  the  PE  failure  rate  as  a  reference  normalized 
with  respect  to  the  time  unit  such  that  a\t  s  1  and  with  a  coverage 
factor  of  c  =  0.99.  The  computations  were  carried  out  using  the 
MGRE  (Model  Generator  and  Reliability  Evaluator)  software  package 
described  in  [5]. 

The  main  purpose  of  the  tabulations  in  Table  III  is  to  show  the 
interdependencies  of  the  different  components  now  contained  in  the 
model.  Column  b  corresponds  to  the  reliability  and  MTTF  results 
obtained  when  only  PE’s  failures  are  considered.  Columns  c  to  /  are 
the  results  of  detailed  modeling  with  different  sets  of  failure  rates. 
Notice  for  example  in  column  e,  the  sensitivity  to  switch  failures  is 
reflected  in  a  reduced  MTTF.  The  array  is  less  sensitive  to  increments 
in  the  failure  rate  of  BL’s  because  at  all  times  the  number  of  active 
switches  is  greater  than  the  number  of  active  BL’s.  The  simplex  sx 
case  in  column  a,  corresponds  to  the  case  of  the  failure  of  the  array 
when  a  single  processor  fails;  the  Reliability  Improvement  Factors 
(RIF’s)  were  calculated  with  respect  to  the  simplex  case. 

Let  us  consider  a  simplified  model  such  as  the  one  proposed  in  [16] 
where  the  reliability  of  the  array  is  expressed  as  R(t)  =  Rn(t)  x 
ft.(t).  The  terms  ft,  and  A.  refer  to  the  reliability  of  nonredundant 
and  redundant  hardware,  respectively.  For  the  SRE  case,  let  the 
number  of  PE’s  in  the  array  be  the  redundant  hardware,  then 

RAt)  =  Ec'f!)(e",B,,)'"(1"r',“,)‘- 
1=0  '  ' 

Considering  IOL’s,  HL’s.  and  Switches  as  the  nonredundant  hard¬ 
ware.  then 

— +  (n+n*)i»3  +  n|n-M)oj)l 

Table  IV  shows  the  reliability  and  MTTF  results  obtained  with  a 
set  of  failure  rates  as  shown  in  row  g  in  Table  II  and  with  c  =  0.99. 
The  results  given  by  the  simplified  model  show  an  underestimation 
of  the  reliability  of  a  4  x  4  array  with  the  SRE  scheme  as  compared 
to  the  results  obtained  by  solving  the  Markov  model  generated  with 
detailed  modeling. 


-1EE  'RANS  ACTIONS  ON  COMPUTERS.  VOL  41.  NO.  9,  SEPTEMBER  IW 


1199 


TABLE  Ill 

Reliability  and  RIF’s  for  SRE  and  ARCE  with  c  =  0.09  and  Failure  Rates  Given  in  Table  II 


Array 

R/RIF 

time 

a 

h 

f* 

d 

•* 

/ 

SRE 

0.1 

0.201897 

0.97553 

0.974196 

0.949305 

0.792264 

0.901154 

R 

0.3 

0.008230 

0.74327 

0.729796 

0.663918 

0.3869727 

0.489128 

0.5 

0.000335 

0.428912 

0.410413 

0.347755 

0.141386 

0.194080 

UF 

0.1 

— 

32.62 

3.93 

15.91 

3.86 

8.50 

0.3 

— 

3.86 

3.67 

2.98 

1.62 

2.04 

0.5 

— 

1.75 

1.70 

1.54 

1.17 

1.27 

MTTF 

- 

• 

0.510087 

0.496435 

i  >.447789 

0.33785 

\RCE 

i.l 

0.201897 

i).986691 

0.986227 

0.985946 

0.983185 

0.985713 

R 

0.3 

0.008230 

0.969483 

0.968212 

0.966729 

0.939159 

0.959977 

0.5 

0.000335 

0.95268 

0.947997 

0.940533 

0.331963 

0.903652 

RIF 

0.1 

— 

59.97 

57.95 

55.75 

46.78 

47.29 

0.3 

— 

29.31 

-6.17 

21.72 

0.5 

— 

21.13 

19.22 

16.61 

5.93 

9.65 

MTTF 

- 

* 

2.07274 

1.90644 

1.74707 

1.02882 

1.23279 

TABLE  IV 

Reliability  Results  of  Simplified  and  Detailed  Mooels 


Time 

Simplified 

Detailed 

0.1 

0.915052 

0.954906 

0.3 

0.613426 

0.687296 

0.5 

0.311454 

0.371557 

MTTF 

0.41571 

0.464996 

VII.  Conclusions 

In  this  paper  Modified  Stochastic  Petri  Nets  were  proposed  as  an 
extension  of  SPN’s  to  represent  fault-tolerant  processor  arrays  and 
generate  detailed  Markov  models  that  include  several  components 
of  the  array.  A  mapping  from  transitions  and  markings  in  a  MSPN 
representation  to  transitions  and  srates  in  the  Markov  model  was 
Jerived.  This  mapping  allows  the  construction  of  the  corresponding 
Markov  model  as  the  reachability  graph  is  being  generated.  The 
proposed  modeling  approach  has  been  applied  in  the  analysis  of 
several  reconfiguration  schemes  [5],  [15].  The  use  of  this  approach 
was  illustrated  via  the  detailed  analysis  of  the  SRE  reconfiguration 
scheme.  Comparative  results  show  the  effect  of  detailed  modeling  and 
component  failure  interdependencies  in  the  SRE  scheme  as  well  as 
in  the  ARCE  (Alternate  Row  Column  Elimination)  scheme  analyzed 
in  detail  in  [S]. 

Detailed  reliability  modeling  involves  three  interrelated  problems. 
In  the  first  place  a  thorough  analysis  of  the  array  is  required  such 
that  all  interdependencies  of  failures  of  different  components  are 
defined:  this  can  be  a  difficult  task  for  complex  reconfiguration 
schemes.  Second,  the  more  detail  is  included  in  the  model,  the 
larger  is  the  state  space  in  the  resulting  Markov  chain.  These 
two  problems  are  inherent  to  reconfigurable  arrays.  The  use  of 
MSPN’s,  as  input  to  MGRE.  greatly  reduces  the  task  of  capturing 
the  fault  behavior  of  the  array;  once  the  MSPN  model  is  derived 
it  remains  invariant  as  Markov  models  can  be  generated  for  any 
array  size  and  for  any  set  of  failure  rates.  The  last  problem  involves 
the  solution  process  of  the  models  generated.  Fortunately,  large 
nonstiff  models  can  be  solved  using  the  randomization  technique 

[17] ,  which  is  implemented  in  MGRE.  We  emphasize  that  the  models 
generated  are  upper  triangular  sparse  matrices  which  facilitate  the 
solution  process.  Also,  approximation  and  reduction  methods  can  be 
applied  to  large  models  to  derive  lower  and  upper  reliability  bounds 

( 18) .  These  techniques  are  being  implemented  in  MGRE.  Finally, 
considering  the  fact  that  reconfiguration  algorithms  are  primarily 
designed  to  treat  failures  of  PE’s  only,  a  MSPN  representation  for 
a  given  reconfiguration  scheme  will  depend  on  the  assumptions 
made  as  to  how  the  algorithm  treats  failures  on  component  types 


other  than  PE’s:  in  this  regard,  the  application  chosen  (SRE)  il¬ 
lustrates  the  modeling  of  failure  interdependencies  as  well  as  the 
aggregation  of  probabilities  of  survival  and  coverage  factors.  The 
combination  of  marking-dependent  variable  multiplicities  and  the 
set  of  attributes  associated  to  transitions  contribute  to  make  of 
MSPN’s  a  flexible  and  compact  modeling  tool  to  capture  the  fault 
behavior  of  Fault-Tolerant  Processor  Arrays.  Such  flexibility  is  not 
found  in  other  packages  such  as  Metasan  [13],  GreatSPN  [19], 
SHARPE  [7],  etc.  Some  of  the  features  of  MSPN’s  are  incorpo¬ 
rated  in  SPNP  [11]  which  is  being  developed  independently  of 
MGRE. 

References 

[  X I  C.  S.  Raghavendra.  A.  Aviziems.  and  M.  Ercegovac.  ’Fault-tolerance 
in  binary  tree  architectures."  IEEE  Trans.  Comput..  vol.  C-33.  pp. 
568-572.  June  1984. 

[2[  (.  Keren  and  D.  K.  Pradhan.  -Modeling  the  effect  of  redundancy  on 
vield  and  performance  of  VLSI  systems.  ’  IEEE  Trans.  Comput..  vot. 
C-36.  pp.  344-355.  Mar.  1987. 

[3|  J.  A.  B.  Fortes  and  C.  S.  Raghavendra.  “Gracefully  degradable  processor 
arrays."  IEEE  Trans.  CompuL.  vol.  C-34.  pp.  1033-1044.  Nov.  1985. 

[4]  M.  G.  Sami  and  R.  Stefanclli.  "Recontigurable  architectures  for  VLSI 
processing  arrays."  Proc.  IEEE.  vol.  74.  pp.  712-722.  May  1986. 

[5]  N.  Lopez- Benitez.  "Detailed  modeling  and  reliability  estimation  of  fault- 
tolerant  processor  arravs."  Ph.D.  dissertation.  Dep.  Elec.  Eng.,  Purdue 
Univ..  1989. 

[6]  T.  E.  Mangir  and  A.  Aviziems.  “Fault-tolerant  design  for  VLSI  effect 
of  interconnect  requirements  on  yield  improvement  of  VLSI  designs." 
IEEE  Trans.  Comput..  vol.  C-31,  pp.  609-616.  July  1982. 

[7J  R,  A.  Sahner  and  K.  S.  Trivedi.  “A  hierarchical.  combinaiorial-Markov 
method  of  solving  complex  reliability  models."  in  AC  Si! IEEE  Proc.  Fail 
Joint  Comput.  Conf..  Nov.  1986.  pp.  817-825. 

{8 1  W.  G.  Bourictous.  W.  C.  Carter,  and  P.  R.  Schneider.  “Reliability 
modeling  techniques  for  self-repatring  computer  systems."  in  Proc.  24th 
Mat.  Conf.  ACM.  Aug.  1969. 

[9J  W.  Reisig.  Petri  Sets  An  Introduction.  Berlin.  Germany:  Springer- 
Verlag,  1985. 

[  10]  M.  K.  Mollov,  “Performance  analysts  using  Stochastic  Petri  Nets."  IEEE 
Trans.  Comput..  vol.  C-31  no.  9.  pp.  913-917,  Sept.  1982. 

[11]  G.  Ciardo.  J.  Muppala.  and  K.  S.  Trivedi.  “SPNP  Stochastic  Petri  Net 
package."  in  Proc.  Petri  Sets  and  Perform.  Models.  Dec.  1989.  pp. 
142-151. 

[12]  M.  A  Marsan.  G.  Conte,  and  G.  Balbo.  “A  class  of  generalized 
Scochistic  Petri  Nets  for  the  performance  evaluation  of  multiprocessor 
systems,"  .40#  Trans.  Comput.  Svsl.  vol.  2.  no.  2.  pp.  93-122.  May 
1984. 

{ 13]  W.  H.  Sanders  and  J.  F.  Meyer.  "METASAN  A  performability  evalua¬ 
tion  tool  based  on  Stochastic  Activity  Networks."  in  ACMIIEEE  Proc. 
Fad  Joint  Comput  Conf,  Nov.  1986. 

[14]  a.  Retbman  and  K.  S.  Trivedi.  “Numerical  transient  analysis  of  Markov 


1200 


IEEE  TRANSACTIONS  ON  COMPUTERS.  VOL.  41.  NO.  9.  SEPTEMBER  1992 


models."  Comput  and  Oper.  Res.,  vol.  15.  no.  1.  pp.  19-36.  1988. 

(15]  M.  Chean  and  J.  A.  B.  Fortes.  “The  Full-Use-of-Suitable-Spares  (FUSS) 
approach  to  hardware  reconfiguration  for  fault-tolerant  processor  ar¬ 
rays,"  IEEE  Trans.  CompuL.  vol.  39.  no.  4,  pp.  564-571,  Apr.  1990. 

(16]  Y.  X.  Wang  and  J.  A.  B.  Fortes.  “On  the  analysis  and  design  of 
hierarchical  fault-tolerant  processor  arrays."  in  Proc.  Ini.  Workshop 
Defect  and  Fault  Tolerance  in  VLSI  Syst."  Oct.  1988. 

(17]  D.  Gross  and  D.  R.  Miller,  “The  randomization  technique  as  a  modeling 


tool  and  solution  procedure  for  transient  Markov  processes."  Oper.  Res.. 
vol.  32  no.  2,  pp.  343-  361.  Mar.-Apr.  1984.. 

(18]  M.  Smotherman.  R.  M.  Geisu  and  K.  S.  Trtvedi.  “Provably  conservative 
approximations  to  complex  reliability  models."  IEEE  Trans.  Comput.. 
vol.  C-35.  no.  4,  pp.  333-338.  Apr.'  1986. 

[19]  G.  Chiola.  “A  software  package  for  the  analysis  of  Generalized  Sio- 
chastic  Petri  Net  models.”  in  Proc.  Int.  Workshop  Timed  Petri  Nets.  Julv 
1985. 


REFERENCE  NO.  10 


Wang,  Y-X.  and  Fortes,  J.A.B.,  "On  the  Analysis  and  Design  of  Hierarchical  Fault- 
Tolerant  Processor  Arrays,"  International  Workshop  on  Defect  and  Fault-Tolerance 
in  VLSI  Systems,  October  6-7, 1988,  Springfield,  Massachusetts. 

Note  -  This  paper  addresses  the  problem  of  designing  hierarchically  organized  fault- 
tolerant  processor  arrays  so  that  their  reliability  is  optimized. 


ON  THE  ANALYSIS  AND  DESIGN  OF  HIERARCHICAL 


FAULT-TOLERANT  PROCESSOR  ARRAYS 

Y-X.  Wang  and  Jose  A.  B.  Fortes* 

School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  IN  47907 

1.  INTRODUCTION 

The  relevance  of  Processor  Arrays  (PA's)  stems  not  only  from  their  ability  to  meet  the 
computational  demands  of  many  real-time  applications  but  also  from  their  suitability  for 
efficient  implementations  using  Very  Large  Scale  Integration/Wafer  Scale  Integration 
(VLSI/WSI)  technology  l.  Fault- tolerant  PA’s  {FTPA’s)  contain  hardware  redundancy  for 
some  of  their  components  (globally  referred  to  as  "redundant  hardware")  as  well  as  hardware 
for  which  spares  are  not  provided  (called  "non-redundant  hardware").  Typically,  redundant 
hardware  includes  processor  elements  (PE’s)  and,  sometimes,  communication  links,  whereas 
non-redundant  hardware  may  include  circuitry  for  control,  fault-detection,  hardware 
reconfiguration,  etc..  Low  reliability  in  a  bad  FTPA  design  may  result  from,  among  other 
reasons,  the  unreliability  of  non-redundant  hardware.  This  is  particularly  true  for  very  large 
arrays  and/or  long  operation  times.  This  problem  can  be  avoided  in  well  designed  hierarchi¬ 
cal  FTPA’s,  i.e.,  processor  arrays  organised  as  small  fault-tolerant  arrays  of  small  FTPA’s. 
This  paper  addresses  the  problems  of  analytically  estimating  and  optimizing  the  reliability  of 
hierarchical  FTPA’s. 

Extensive  work  has  been  done  towards  devising  FTPA’s  and  a  non-exhaustive  list  of 
references  is  *-u.  There  exist  many  fault-tolerant  schemes  such  as  the  Column  Redundancy 
(CR)  (3  8  and  others),  Diogenes  (DI)  8,  Complex  Fault  Stealing  (CFS)  7  and  Triplicated 
Modular  Redundancy  (TMR)  l0.  Hierarchical  FTPA’s  or  closely  related  approaches  have 
been  proposed  ( 13_18,  ®>_J3  and  others)  for  the  yield  improvement  purposes.  Because  closed 
form  expressions  for  the  reliability  of  FTPA’s  are  not  easy  to  derive,  one  of  the  motivations 
of  this  paper  is  to  provide  the  general  reliability  estimation  model  based  on  the  properties  of 
the  hazard  function  that  describes  the  instantaneous  failure  rate  of  FTPA’s. 

The  basic  ideas  and  the  approximation  model  of  the  hazard  function  are  studied  in  Sec¬ 
tion  2.  Section  3  introduces  two  general  approximation  models  for  the  reliability  of  FTPA’s: 
the  "single- Weibull"  and  "double- Weibull"  approximation  models.  The  first  one  is  used  to 
estimate  the  reliability  of  FTPA’s  where  the  reliabilities  of  unit  non-redundant  area  (where 
the  non-redundant  hardware  is  laid)  and  redundant  area  (where  the  redundant  hardware  is 
laid)  have  the  same  value.  The  second  one  can  be  used  to  estimate  the  reliability  of  FTPA’s 
where  the  two  reliabilities  are  independent  variables.  In  Section  4,  the  motivations  of 
hierarchical  structure  FTPA  are  discussed  and  the  methodology  for  the  design  of  bi-level 
hierarchical  FTPA’s  is  presented.  A  case  study  is  briefly  described  that  illustrates  the  very 
high  reliability  improvement  achievable  by  a  bi-level  structure  in  contrast  with  the  poor  reli¬ 
ability  of  single-level  FTPA’s  with  the  same  amount  of  redundancy.  Section  5  is  dedicated 
to  conclusions. 


This  rese arch  *u  supported  in  part  by  the  National  Seines  Foundation  under  Grant  DCl- 
Ml 97 4 5  and  in  part  by  the  Innovative  Science  and  Technology  Office  of  the  Strategic  Defense 
Initiative  Organization  and  was  administered  through  the  Office  of  Naval  Research  under  contract 
No.  00014-85- k-OSM. 


2.  THE  OVERALL  RELIABILITY  AND  HAZARD  FUNCTION  OF  FTPA’S 

The  overall  reliability  of  FTPA’s  depends  on  many  factors,  e.g.,  the  reconfiguration 
scheme  used,  number  of  redundant  components  added,  size  of  the  array,  complexity  of  PE’s, 
reliability  of  switches  and  control  circuits,  etc.  One  reconfiguration  scheme  may  have  better 
performance  than  the  others  under  some  circumstances  and  have  worse  performance  other¬ 
wise.  For  example,  as  shown  in  Figure  1,  the  DI  scheme  is  better  than  CFS  scheme  if  the 
size  of  FTPA’s  is  small  whereas  for  large  size  FTPA’s,  the  opposite  is  true  (see  13  for  details 
and  assumed  design  and  technology  parameters). 

Without  loss  of  generality,  let’s  assume  that  an  FTPA  contains  (nxn)  4-  k  PE’s,  where 
k  is  the  number  of  spare  PE’s.  Let  An(n,k)  and  Af(n,£)  represent  the  area  of  non- 
redundant  and  redundant  hardware  in  FTPA’s,  and  Rr(n,k,rr)  and  Rn[n,k,rH )  denote  the 
reliabilities  of  the  redundant  and  non-redundant  hardware  respectively,  where  r,  is  the  relia¬ 
bility  of  a  unit  of  redundant  area  (in  this  paper,  the  area  for  a  single  PE  is  defined  as  the 
unit  of  area),  and  %  is  the  reliability  of  a  unit  of  non-redundant  area.  The  reliability  of  an 
FTPA  can  be  written  as 

R{n,k,rn,rT)  =/?„(?», £,r„)  •  Rr{n,k,rr)  .  (2.1) 

Clearly,  the  overall  reliability  of  an  FTPA  is  at  most  as  good  as  the  lowest  of  the  reliabilities 
of  the  redundant  hardware  and  the  non-redundant  hardware. 

In  an  FTPA,  if  some  PE’s  fail,  the  reconfiguration  scheme  will  try  to  replace  these 
faulty  PE’s  and  make  the  whole  system  operational.  For  a  given  FTPA  with  (nxn)+t  PE’s, 
let  r,  be  the  reliability  of  each  PE,  then  the  overall  reliability  can  be  written  as 


R{n,k,rntrr) 


(2.2) 


where  s{  is  the  probability  that  the  reconfiguration  scheme  succeds  when  there  are  i  faulty 
PE’s  in  the  array.  Expressions  (2.1)  and  (2.2)  are  valid  for  any  type  of  FTPA.  However,  in 
order  to  facilitate  the  presentation  and  discussion  of  the  ideas  and  results,  two  assumptions 
are  introduced  next  and  apply  to  the  remainder  of  this  paper.  For  a  given  FTPA,  it  is 
assumed  that,  (1)  all  PE’s  are  redundant  hardware  and  the  remaining  circuitry  corresponds 
to  non-redundant  hardware;  (2)  at  the  exception  of  the  mechanisms  and  scheme  used  for 
reconfiguration,  all  other  procedures  and  hardware  required  for  fault  tolerance  are  perfect 
(e.g.,  fault  detection  and  location). 

In  general,  the  Cumulative  Distribution  Function  (CDF)  of  any  FTPA  component  is 
assumed  to  be  exponential  with  operation  time.  Then  the  corresponding  hazard  function  of 
each  component  is  a  constant  12.  In  a  simplex  processor  array  (the  array  without  any  redun¬ 
dancy),  a  failure  in  any  component  causes  the  whole  array  to  fail  and  the  hazard  function  of 
the  entire  array,  h(t),  is  also  a  constant,  i.e.,  A(f)  =X  >0.  The  corresponding  reliability  is 

R(0  =  «"X*. 

In  FTPA’s,  a  faulty  redundant  hardware  component  (e.g.  a  PE)  may  be  replaced  if 
there  is  a  spare  available  in  the  array.  Clearly,  at  time  t=0,  all  components  (including  the 
spare  ones)  are  assumed  to  be  good,  the  probability  of  a  successful  reconfiguration  is  high, 
and  the  total  number  of  mortal  failures  of  the  array  per  unit  time  is  low.  However,  as  time 
goes  on,  i.e.,  when  the  value  of  t  becomes  large,  some  of  the  spare  components  are  exhausted. 
The  possibility  of  recovery  from  additional  failures  becomes  smaller,  i.e.,  the  probability  of 
successful  reconfiguration  is  lower  and  the  value  of  h(t)  will  increase.  Eventually,  when  most 
of  the  spare  components  are  exhausted,  the  hazard  function  h(t)  will  tend  to  the  constant 
which  is  the  hazard  rate  of  the  array  with  redundancy  exhausted.  The  value  of  h(t)  depends 
on  many  factors  and,  as  for  the  reliability  R(t),  it  is  very  difficult  to  obtain  the  exact  expres¬ 
sion  of  h(t).  However,  the  following  simple  properties  of  h(t)  can  be  inferred  from  the  discus¬ 
sions  above:  i)  It  should  be  a  monotonic  increasing  function  of  time  t  and  tend  to  a  con¬ 
stant.  ii)  It  is  a  continuous  function  and  changes  its  value  smoothly  in  range  (0,  -Hoc).  The 
well  known  Weibull  estimation  model  12  17-12  will  be  studied  and  used  to  approximate  the 
h(t)  based  on  these  properties.  In  the  rest  of  this  paper,  R  ( t )  and  h  (t)  denote  approximar 
tions  of  R(t)  and  h(t),  respectively. 

The  two-parameter  Weibull  estimation  model  has  the  form 

A*(t)  =  a,t>,  s,  beR,  aX)  and  6X).  (2.3) 


The  parameters  a  and  b  are  called  the  scale  parameter  and  shape  parameter,  respectively. 


The  corresponding  formula  for  the  reliability  function  is 


The  expressions  for  both  A*(£)  and  R  *(f)  are  very  simple  compared  to  other  possible  approx¬ 
imations,  such  as  Gamma,  Beta  and  Lognormal  distributions  12  l®.  On  the  other  hand, 
according  to  our  experience  and  previous  work  17-1S,  the  Weibull  distribution  model  is  very 
flexible  and  capable  of  providing  a  rich  family  of  approximations  for  increasing  (and  decreas¬ 
ing)  hazard  functions  h(t)  by  adjusting  the  scale  parameter  a  and  the  shape  parameter  b. 
Because,  for  FTPA’s,  h(t)  is  a  positive  monotonic  increasing  function  of  t,  parameters  a  and 
b  are  always  positive  numbers. 

3.  GENERAL  MODELS  FOR  FTP  A  RELIABILITY 

The  reliability  of  FTPA’s  is  not  only  a  function  of  time  but  also  a  function  of  the  size 
of  FTPA’s.  A  generalized  reliability  evaluation  model  that  captures  this  dependence  on  size 
is  studied  in  this  section. 

For  a  simplex  PA  of  size  N  (i.e.,  N=  number  of  PE’s),  the  hazard  function,  in  terms  of 
the  size,  is  a  constant,  i.e.,  h(N)  —  a  —  In—.  The  corresponding  reliability  is 

T  f 

R{N)  —  r?  =  e~a'N .  For  FTPA’s,  instead  of  a  constant,  the  hazard  function  is  a  monotonic 
increasing  function  of  N.  This  can  be  explained  by  the  following  two  facts.  First,  the  redun¬ 
dancy  factor  rj  =  decreases  significantly  with  the  increasing  of  N.  This  may  cause  a 

N—k 

shortage  of  the  spares  to  replace  faulty  PE’s  in  a  very  large  array  and  implies  that  the  pro¬ 
bability  of  the  failure  of  the  whole  system  increases  with  the  increase  of  N.  Secondly, 
because  the  size  of  the  non-redundant  area  An,  usually,  is  proportional  to  the  size  of  the 

array,  the  reliability  of  the  non-redundant  hardware  decreases  with  the  increase  of 

N. 

As  mentioned  in  Section  2,  the  Weibull  model  can  provide  a  simple  and  good  approxi¬ 
mation  for  those  non-linear  monotonic  increasing  (or  decreasing)  hazard  functions.  So,  h(N) 
can  be  modeled  by  another  Weibull  formula,  i.e., 

h‘(N)  =  a'-lV6' ,  a',  i'GlR,  and  <**>0,  VX).  (3.1) 

Without  loss  of  generality,  let’s  assume  that  the  variable  N  is  a  continuous  variable;  then  the 
corresponding  reliability  in  terms  of  the  size  N  can  be  written  as 

R  ‘(N)  =  e~fk(N)  =  e~a <  (3.2) 

where  a  —  -——-  and  /3  =  6'+l. 
b  +1 

There  is  no  time  variable  t  in  Equation  3.1  which  implies  the  time  is  fixed  in  the  equ  v 
tion,  i.e.,  t  =*  f0.  Assume  that,  for  a  given  reconfiguration  scheme,  the  parameter  V ,  which 
describes  the  sensitivity  of  h  ( N )  to  the  size  N,  is  independent  of  the  time  t;  in  other  words, 

assumed  to  be  the  increasing  rate  of  the  hazard  rate,  ^  for  a  given  fault-tolerant 

A  ( N ) 

scheme,  is  independent  of  time  (parameter  o'  should  be  a  function  of  t).  If  parameter  b  in 
Equation  2.3  is  independent  of  the  size  N,  then  the  hazard  rate  can  be  approximated  by  a 
model  with  two  variables. 

h,{N,t)^c'ti:N>f  ,  eX),  6X),  VX)t  c,b,VeiR  (3.3) 

where  c  is  a  new  parameter.  (The  error  of  the  two  variable  approximation  model  may 
increase  if  parameters  b  and  V  are  not  independent  of  N  and  t,  respectively.)  The 
corresponding  two  variable  model  for  the  reliability  function  is 

R'{N,t)  =  =  e~(»+ixv+i)  ‘  S  =e— ai'-AP  (34) 

where  parameters  Or  -  - - - 0—  b+1  and  7  =  6/+l  are  positive  real  numbers. 

(i+lXy+l} 

Besides  the  time  t  and  the  size  N,  the  reliability  of  FTPA  also  strongly  depends  on  rr, 
the  reliability  of  single  PE  which  is  determined  by  the  complexity  of  PE,  thchnology  etc.. 
Because  rf(t)  =  e~x<,  where  X  is  the  failure  rate  of  a  single  PE,  it  follows  that  t  sz~r~  In  rf(t). 

A 


Let  <5  =  y-  >  0.  By  bringing  t  into  Equation  3.4,  the  reliabilityt  then  can  be  written  in  terms 
of  N,  and  rf  as 

R  (3.5) 

Instead  of  a  function  of  time  t,  r,  itself  can  be  an  independent  variable.  Letting  a' = a -6^, 
Equation  3.5  can  be  rewritten  as 


R '(N,rr)  - 


(3.6) 


In  general,  if  the  reliability  of  unit  non-redundant  area  always  has  the  same  value  as 
the  reliability  of  a  single  PE,  i.e.,  r%=r,,  then  the  reliability  of  the  array  can  be  described 
by  a  tingle-  Weibv.ll  distribution  formula  in  terms  of  N  and  rf  as  in  Equation  3.6.  However,  in 
some  cases,  e.g.,  in  hierarchical  fault- tolerant  structures  14-18,  r„  and  rf  may  vary  indepen¬ 
dently  (details  are  shown  in  Section  4).  Therefore,  the  reliability  of  redundant  hardware  and 
non-redundant  hardware  should  be  described  separately  in  terms  of  variables  rr  and  ra. 
Because  the  overall  reliability  of  the  system  is  the  product  of  the  reliability  of  the  non- 
redundant  hardware  and  the  redundant  hardware,  another  model,  doublt-Weibull  distribution 
formula,  is  used  to  estimate  the  reliability, 


(3.7) 


where  at,,  0,  and  7,  are  the  six  parameters  that  must  be  estimated  for  different 

FTPA’s.  Equations  3.6  and  3.7  are  used  in  next  section  to  study  the  different 
reconfiguration  schemes  and  to  derive  a  methodology  to  find  the  optimal  structure  of  a  bi¬ 
level  FTP  A. 


4.  THE  ANALYSIS  AND  DESIGN  OF  HIERARCHICAL  FTPA’S 

The  basic  idea  behind  the  design  of  a  hierarchical  FTPA  is  best  explained  if  the  partic¬ 
ular  case  of  a  bi- level  FTPA  is  considered  first.  A  bi-level  FTPA  consists  of  a  set  of  fault- 
tolerant  subarrays.  In  other  words,  the  full  array  is  partitioned  into  subarrays  and  can  be 
thought  of  as  an  array  of  subarrays.  Both  the  subarrays  and  the  array  of  subarrays  use 
some  fault-tolerance  scheme.  The  subarrays  are  hereon  denoted  as  lst-level  FTPA ’s  and  the 
array  of  subarrays  is  referred  to  as  the  2nd-level  FTPA.  A  2nd-level  FTPA  can  be  thought 
of  as  an  FTPA  where  the  basic  modules  are  themselves  FTPA’s  and,  physically,  it  is  the 
same  as  the  bi-level  FTPA.  The  extension  to  multi-level  FTPA’s  can  be  easily  made  by  real¬ 
ising  that  an  n-level  FTPA  consists  of  an  FTPA  whose  basic  modules  are  (n-l)-level  FTPA’s 
which  are  FTPA’s  composed  of  (n-2)-ievel  FTPA’s,  etc.  For  convenience  of  presentation, 
bi-level  arrays  are  assumed  hereon  and,  unless  stated  otherwise,  the  basic  ideas  and  results 
apply  to  multi-level  arrays  as  well. 

When  faults  occur  in  a  bi-level  array,  reconfiguration  is  first  attempted  at  the  lst-level 
FTPA’s,  and,  when  reconfiguration  fails  at  the  lst-level,  then  reconfiguration  at  the  2nd- 
level  FTPA  is  attempted.  The  total  non-redundant  hardware  in  a  bi-level  structure  consists 
of  the  switches,  links  and  control  circuits  at  the  2nd-level.  Therefore,  bi-level  FTPA’s  can  be 
expected  to  have  better  reliability  than  single-level  FTPA’s  for  at  least  two  reasons:  (1)  the 
area  of  non-redundant  hardware  in  bi-level  FTPA’s  can  be  smaller  than  that  in  single-level 
FTPA’s  and  (2)  the  size  of  arrays  and  the  reconfiguration  approach  used  at  each  level  can  be 
chosen  so  that  optimal  reliability  results,  thus  avoiding  the  inevitable  reliability  degradation 
that  occurs  when  the  size  of  single-level  arrays  grows  too  large. 

Having  realized  the  potential  benefits  of  multi-level  FTPA’s,  from  an  engineering  point 
of  view,  it  is  essential  to  have  a  systematic  and  formal  approach  to  the  design  of  these  sys¬ 
tems  so  to  optimize  the  reliability  of  the  overall  FTPA.  This  approach  is  described  in  the 
remainder  of  this  paper. 

4.1  The  optimisation  of  hierarchical  structure 

Without  loss  of  generality,  let  the  size  of  the  lst-level  and  2nd-level  FTPA’s  be 
(n1xn1)  +  kl  and  (ntxns)  +  k2,  respectively.  Here,  k2  and  k2  correspond  to  the  number  of 
spare  PE’s  in  lst-level  and  the  number  of  spare  subarrays  in  2nd-level,  respectively.  Also, 
the  number  of  processors  in  the  bi-level  array  is  (nxn)  +  k  where  n  *  n1xns  and 
k  *=  kx  n2  +  k2  (nj  +  fcj).  The  reliability  of  the  bi-level  FTPA  is  essentially  the  reliability 
of  the  2nd-level  FTPA,  i.e.,  R2  —  R2  (nJ,Jfc2,.R1)  where  Rr  is  the  reliability  of  one  lst-level 


FTP  A,  i.e.,  Rx  —  Rt  ( nvkvrr ).  Assume  that  k1  and  k2  can  be  expressed  as  the  functions  of 
nt  and  n2  for  the  given  reconfiguration  schemes,  Rl  and  R2  can  be  simplified  as 
Rl  =Ri(nl  rf)  and  R2  —  (n2,R1).  For  a  given  array  size  n,  different  values  of  n2  significantly 
affect  both  Rx  and  R2.  A  small  value  of  nt  may  yield  a  higher  value  of  Rt  but  does  not 
necessarily  provide  a  high  value  on  R2  because  of  the  increase  of  n2  (n2  —  n/n1).  The  exam¬ 
ple  of  the  reliability  of  a  bi-level  FTPA  against  the  number  of  3ubarrays  n2  is  shown  in  Fig¬ 
ure  2.  The  value  of  n2  for  which  R2  is  optimal  is  the  solution  of  the  equation 

dR2  dR2  dR2  dRx  dR2  dk2 

m 7  =  ^7  +  "^7  ’  1^  + 117  ’  =  0 

Once  the  solution  n2  is  obtained,  n2  can  be  found  because  n  is  assumed  to  be  known.  Unfor¬ 
tunately,  for  a  bi-level  FTPA  design,  Equation  4.1  is  hard  to  obtain  and  solve.  In  order  to 
deal  with  this  problem,  a  new  methodology  based  on  the  general  single-  and  double-Weibull 
approximation  models  discussed  in  Section  3  is  proposed  in  which  individual  models  are  used 
to  estimate  the  reliabilities  of  the  first  and  the  second  levels. 


Without  loss  of  generality,  assume  that  the  area  of  hardware  is  normalized  so  that  a 
single  PE  always  occupies  one  unit  of  chip  area.  Let  the  reliability  of  hardware  in  any  unit 
of  area  have  the  same  value  regardless  the  kind  of  circuitry  in  it.  For  the  reconfiguration 
schemes  used  at  the  first  level,  because  the  reliability  of  a  basic  PE  is  given  as  a  constant, 
the  only  one  variable  for  R2  is  the  size  of  subarrays  nv  The  single-Weibull  approximation 
model  can  be  used  to  estimate  i?1(n1)  as 


*i(»i)  =  « 


—a'-n? 


(4.2) 


where  a1  and  0  are  real  positive  constants. 

The  approximation  form  R2  of  the  reliability  for  the  second-level  as  a  function  of  n2  is 
more  involved  than  i?2.  This  is  because  the  reliability  of  one  module  (subarray)  in  2nd-level 
is  /^(njJ  which  is  the  function  of  n2  (it  is  also  a  function  of  n2  for  a  fixed  n  because 
nl  =  n/n2)  instead  of  a  constant  (as  rt  in  the  lst-level).  However,  the  reliability  of  unit 
non-redundant  area  r.  is  still  a  constant.  The  reliability  of  redundant  and  non-redundant 
area  Rr  and  R%  in  2nd-level  will  not  always  vary  congruously  with  n2.  For  example,  some¬ 
times,  a  decrease  of  Rn  due  to  a  larger  value  of  n2  may  cause  Rr  to  increase.  Because  the 
goal  here  is  to  find  the  trade-off  between  n2  and  nt  on  which  the  overall  reliability  is  optim¬ 
ized,  the  single-Weibull  model  can  not  be  used  here  any  more  and,  the  reliability  Rr  and  R% 
should  be  estimated  separately  in  order  to  represent  their  variations  with  n2.  According  to 
the  general  expression  of  the  reliability  in  Equation  2.1,  the  reliability  in  2nd-level  for  some 
reconfiguration  scheme  can  be  written  as 

i?2(n2,r.,i21(n2))  =  5»(n2,r.)-i?f  (»2,i?l(n2))  (4.3) 


where 


Rn(nt>rm) 


- 


(4.4) 


Because  rm  is  a  constant,  R2(n3rm,Rl(n2))  can  be  written  in  terms  of  n2  and  Rt 


Rr(*»R i(«z))  =  E 

i-O 


n2+k2 

i 


(1  -  RiY 


(4.5) 


A„  is  the  non-redundant  area  in  2nd-level  and  k2  is  the  number  of  spare  subarrays  which  is, 
usually,  a  known  function  of  n2  for  a  given  reconfiguration  scheme.  The  double-Weibull 
approximation  model  then  should  be  used  to  express  the  reliability  of  the  reconfiguration 
schemes  used  in  2nd-level  of  FTPA.  Because  r„  is  a  constant,  Rn  can  be  modeled  as 


(4.6) 


where  a  and  b  are  positive  real  number  parameters.  From  (4.4)  and  (4.6),  it  is  clear  that 

A„(n2)  is  approximated  by  — ~---n2,  i.e.,  the  non-redundant  area  is  assumed  to  grow  pro- 

In  r# 

portion  ally  with  some  power  of  the  number  of  processors  in  the  array. 

The  approximation  form  used  for  Rffa^Rj)  has  two  variables,  r»2  and  Rv 


■®r(n2>-^l)  =  e 

where  u,  v  and  w  are  positive  real  constants, 
tion  form  R2  is 


(4.7) 


In  summary,  the  double-Weibull  approxima- 


Because  nx  =  n/n2,  letting  a  =  a;,n'3,  (4.2)  can  be  rewritten  as 

Rl  =  e""'*2'". 

Substituting  R[  into  (4.8)  yields 

n  •  -(*  »i  + 

tin  =  6 


(4.8) 

(4.9) 


— (4-»i  + 

e 


(4.10) 


Equation  (4.1)  can  now  be  solved,  i.e., 
dRj 
dn2 
.  dR : 

Because  Rn^O,  — —  —  0  if  and  only  if 
dn2 

a'b‘n2~l  +a*‘v(w  —  *=0.  (4.12) 

Also  let’s  assume  n2^0,  (4.12)  can  be  simplified  as 

a-b’nj  +Oe*‘v(w  —  j3*u )*n|*“^*J  *0.  (4.13) 

Because  all  parameters  are  positive  real  numbers,  the  necessary  and  sufficient  condition  for 
the  existence  of  a  real  positive  solution  to  (4.12)  is 

tv  —  0"u  <  0  (4.14) 


[a '6  *n2-1  +a*-v{w  -0-«)*nl*“**“l)]  +  a'''‘ **"*'))  -  0.  (4.11) 


There  are  many  fault-tolerant  schemes  (with  different  values  on  the  parameters 
«,  w  and  0)  which  can  be  used  to  construct  the  bi-level  structure.  If  (4.14)  is  not  satisfied 
for  two  chosen  schemes,  then  Equation  4.13  has  no  real  number  solution  and  R2  gets  its 
maximum  at  one  of  the  boundary  points,  i.e.,  n2— 1  or  r»j=nxn.  This  means  the  bi-level 
structure  can  not  provide  better  reliability  with  these  two  schemes  than  a  single-level  struc¬ 
ture  with  one  of  these  schemes.  If  this  happens,  other  schemes  should  be  considered  and 
checked  by  (4.14).  If  (4.14)  holds  for  two  specific  schemes,  Equation  (4.13)  can  be  rewritten 
as 

a  *6  +Qf**»(w  -£*«)  =0.  (4.15) 

Letting  8  =*  w  —  0-u  and  <t>  —  —ct*’v8/(a'b),  then  the  optimal  solution  is 

n,*  =  (4.16) 

Substituting  (4.16)  in  (4.10)  yields  the  maximum  reliability  attainable  with  a  bi-level  FTPA 
using  two  given  fault-tolerance  schemes 

r;  =se-|.-4»^  +  «*-4y/(*-#)|>  (4.17) 


In  summary,  the  optimal  hierarchical  structure  can  be  obtained  by  the  following  four 
steps:  1)  Getting  the  parameters  of  Weibull  distribution  model  (via  simulation  or  estima>- 
tion)  for  all  previous  fault-tolerant  schemes  that  are  possibly  to  be  used  on  the  hierarchical 
structure.  2)  Choosing  different  combinations  of  the  schemes  in  1st  and  2nd  level  and 
checking  Equation  4.14  to  find  the  possible  candidates.  3)  Substituting  the  parameters  of 
the  possible  combinations  of  the  schemes  into  (4.17),  the  one  which  maximises  R2  is  the 
optimal  combination  of  the  schemes.  4)  Substituting  the  chosen  parameters  into  (4.16)  to 
achieve  the  optimal  partition  n2  of  the  hierarchical  structure. 


4.2  An  illustrative  example 


As  an  example,  the  design  of  a  bi-level  FTPA  with  n=36  is  considered  where  the  sim¬ 
ple  CR  approach  is  to  be  used  in  the  lst-level  and  the  DI  scheme  is  used  in  the  2nd-level. 
For  both  CR  and  DI  schemes  a  column  of  spare  processors  is  used.  The  area  and  reliability 
estimations  were  computed  for  both  the  CR  and  the  DI  methods  and  the  reliabilities  for  each 
of  the  levels  were  approximated  as 


_  -14.88n.r-887 

Rx~  e 


(4.18) 


—[0.027 H.J-28  +  0.84  {ln-^-)*-T4a  h^73®) 

R2  S£  e  '  1  '  (4.19) 

Assume  that  rT  =  rH  =  0.99.  The  values  of  the  variables  in  (4.16)  are  9  —  —11.24,  b  ~  1.28, 
<f>  ~  1.144x10®  and  the  optimal  value  of  n2  is  n2  ==  5.3.  Because  both  and  n2  should  be 
integer  numbers  with  xn2  =  36,  the  number  n2  =  6  is  used.  Then,  there  are  total  1764 
PE’s  in  the  array  and  the  redundancy  factor  n  is  1.36.  The  overall  reliability  is  0.75,  where 
the  reliabilities  for  a  single-level  array  using  the  DI  approach  or  the  CR  approach  with  the 
same  redundancy  factor  are  0.08  and  0.31,  respectively. 


5.  CONCLUSIONS 

Several  important  related  conclusions  can  be  made  from  the  work  reported  in  this 
paper.  First,  according  to  the  properties  of  the  hazard  rate  of  the  FTPA’s,  the  Weibull  esti¬ 
mation  model  can  give  a  good  evaluation  on  the  reliability  of  different  FTPA’s.  The  second 
conclusion  is  that,  based  on  reliability  estimations,  it  can  be  found  that  non-redundant 
hardware  for  fault-tolerant  purposes  does  limit  the  usefulness  of  single-level  FTPA’s  beyond 
a  certain  size.  For  different  fault-tolerant  schemes,  different  array  sizes  and  different  area 
and  technology  parameters,  there  does  not  exist  a  scheme  for  FTPA’s  which  is  universally 
optimal.  In  other  words,  different  fault-tolerant  schemes  are  optimal  for  different  array  sizes 
and  different  technologies  of  VLSI.  The  third  conclusion  is  that  the  hierarchical  FTPA’s  do 
not  suffer  from  the  disadvantage  pointed  out  in  the  second  conclusion  for  single-level  FTPA’s 
and  can  make  highly  reliable. 

To  achieve  the  third  conclusion  mentioned  above,  the  problem  of  designing  optimal  bi¬ 
level  FTPA’s  was  addressed  and  a  methodology  for  its  solution  has  been  described.  The  key 
to  this  methodology  is  to  avoid 

the  complexity  of  the  exact  analytical  expressions  by  using  accurate  functional  approxima¬ 
tions  of  the  reliability  of  FTPA’s  at  different  levels,  These  approximations  are  based  on 
Weibull  reliability  functions. 

8.  REFERENCES 

[1]  C.  Mead  and  L.  Conway,  "Introduction  to  VLSI  Systems,"  Addison- Wesley,  Reading, 
Mass.,  1980. 

[2]  J.  A.  Abraham,  P.  Banerjee,  C-Y.  Chen,  W.  K.  Fuchs,  S-Y.  Kuo  and  A.  L.  N.  Reddy, 
"Fault  Tolerance  Techniques  for  Systolic  Arrays,"  Computer,  July  1987,  pp.  65-74. 

[3]  I.  Koren  and  M.  A.  Breuer,  "On  Area  and  Yield  Considerations  for  Fault-Tolerant  Pro¬ 
cessor  Arrays,"  IEEE  Trans,  on  Computers,  Jan.  1984,  pp.  21-27. 

[4]  S.  Y.  Kung,  "VLSI  Array  Processors,"  Prentice-Hall,  Englewood  Cliffs,  N.J.  1987. 

[5]  A.  L.  Rosenberg,  "The  Diogenes  Approach  to  Testable  Fault  Tolerant  Arrays  of  Proces¬ 
sors,"  IEEE  Trans,  on  Computers,  Oct.  1983,  pp.  902-910. 

[6]  J.  A.  B.  Fortes,  C.  S.  Raghavendra,  "Gracefully  Degradable  Processor  Arrays,"  IEEE 
Trans,  on  Computers,  Nov.  1985,  pp.,  1033-1044. 

[7]  M.  Sami  and  R.  Steffanelli,  ’Fault- Tolerance  and  Functional  Reconfiguration  in  VLSI 
Arrays,"  International  Conference  on  Circuits  and  Systems,  1986. 

[8]  H.  T.  Kung  and  M.  S.  Lam,  ’Fault- Tolerance  and  Two-Level  Pipelining  in  VLSI  Sys¬ 
tolic  Arrays,"  Proc.  Conf.  Advanced  Research  in  VLSI,  Jan.  1984,  pp.  74-83. 

[9]  J-H.  Kim  and  S.M.  Reddy,  "A  Fault-Tolerant  Systolic  Array  Design  Using  TMR 
Method,"  Proc.  IEEE  Int’l  Conf.  Comp.  Design:  VLSI  in  Computer,  Oct.  1985,  pp. 
789-773. 

(10j  J.  Von  Neumann,  "Probabilistic  Logics  and  the  Synthesis  of  Studies,"  C.E.  Shannon 
and  J.  McCarthy  (Ed.),  Priceton  University  Press,  Princeton,  New  Jersey,  1956. 


*(»,*) 


[11]  I.  Koren,  "Comments  on  "The  Diogenes  Approach  to  Testable  Fault- Toler  amt  Arrays  of 
Processors,""  IEEE  Trans,  on  Comp.  Voi.  C-35,  No.l,  Jan.  1986,  pp.  93. 

[12]  D.P.  Siewiorek  and  R.S.  Swarz,  "The  Theory  and  Practice  of  Reliability  System 
Design,”  Digital  Press,  Educational  Services,  DEC,  Bedford,  MASS.,  1982. 

[13]  Y.  Wang  and  J.A.B.  Fortes,  "Hierarchical  Approaches  to  Fault-Tolerace  in  Processor 
Arrays,"  SPIE’s  1987  Tech.  Symp.  on  Optics,  Electro-Optics,  and  Sensors,  1987. 

[14]  T.  Ishikawa,  S.  Momoi,  S.  Shimada  and  Y.  Ogawa,  "Hierarchical  Array  Processor 
(HAP)  Featuring  High  Reliability  and  High  System  Performance,"  1986  Int’l.  Conf.  on 
Parallel  Processing 

[15]  M.  Wang,  M.  Cutler  and  S.Y.H.  Su,  "On-Line  Error  Detection  and  Reconfiguration  of 
Array  Processors  with  Two-Level  Redundancy,"  Proc.  Comp,  on  VLSI  and  Computers, 
May,  1987. 

[16]  N.  Tsuda  and  T.  Satoh,  "Hierarchical  Redundancy  for  a  Linear-Array  Sorting  Chip," 
I.F.I.P.  international  workshop  on  Wafer-Scale  Integration,  23-25  September  1987. 

[17]  A.  Adatia  and  L.K.  Chan,  "Robust  Estimators  of  the  3-Parameter  Weibull  Distribu¬ 
tion,”  IEEE  Trans.  Reliability,  October  1985. 

[18]  H.L.  Harter,  A.H.  Moore  and  R.P.  Wiegand,  "Sequential  Tests  of  Hypotheses  for  Sys¬ 
tem  Reliability  Modeled  by  a  2-Parameter  Weibull  Distribution,"  IEEE  Trans.  Reliabil¬ 
ity,  October  1985. 

[19]  J.B.  McDonald  and  D.O.  Richards,  "Hazard  Rates  and  Generalized  Beta  Distributions," 
IEEE  Trans.  Reliability,  October  1987. 

[20]  K.S.  Hedlund  and  L.Snyder,  "Systolic  Architectures-A  Wafer  Scale  Approach,”  IEEE 
1984  Int’l.  Conf.  on  Comp.  Design:  VLSI  in  Computers,  1984,  pp.  604-610. 

[21]  S.Y.  Kung,  C.W.  Chang  and  C.W.  Jen,  "On  Fault- Tolerance  in  Array  Processors," 
IEEE  1985  Int’l.  Conf.  on  Comp.  Design,  pp.764-768  1985. 

[22]  C.  Jesshope  and  L.  Bentley,  "Techniques  for  Implementing  two-dimensional  wafer-scale 
processor  arrays,"  IEE  Proceedings,  Vol.  134,  Pt.E,  No.2,  March  1987,  pp.  87-92. 

[23]  J.H.  Hwang  and  C.S.Raghavendra,  "VLSI  Implementation  of  Fault-Tolerant  Systolic 
Arrays,"  Proc.  IEEE  Int’l.  Conf.  on  Comp.  Design:  VLSI  in  Computers,  1986,  pp.  110- 
113. 


Fifur*  1.  The  reliability  of  different  single-level  FTPA 
schemes  with  {n*n)+k  processors;  different  app-  >aehes 
are  optimal  for  different  values  of  a  and  low  reliability 
results  for  large  arrays. 


Figure  2.  The  reliability  of  a  bi-level  FTPA,  R„  is  the 
function  of  the  number  of  the  subarrays  nr  The  CR 
scheme  is  used  in  low-level  and  Di  scheme  is  used  in 
high-level  with  a-36.  The  refuse  denoted  by  *V  are 
obtained  by  the  simulation  program  and  the  solid  line 
is  according  to  the  approximation  model,  i.e.,  Equation 
4.10 


REFERENCE  NO.  13 


Wang,  Y.-X  and  Fortes,  J.A.B.,  "Estimates  of  MTTF  and  Optimal  Number  of  Spares 
of  Fault-Tolerant  Processor  Arrays,"  20th  Int’l  Symposium  on  Fault-Tolerant  Com¬ 
puting,  June  26-28, 1990,  Newcastle  Upon  Tyne,  U.K.,  pp.  184-191. 

Note  -  This  paper  presents  an  efficient  analytical  approach  to  the  evaluation  of 
Mcan-Time-To-Failure  (MTTF)  of  fault- tolerant  arrays.  It  also  proposes  a  design  pro¬ 
cedure  that  optimizes  MTTF. 


ESTIMATES  OF  MTTF  AND  OPTIMAL  NUMBER  OF  SPARES 
OF  FAULT- TOLERANT  PROCESSOR  ARRAYS 


Y.X.  Wang  and  Jose  A.B.  Fortes 
School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  IN  47907 
fortes@ee.ecn.  p  urdue.edu 


ABSTRACT 

An  important  problem  in  the  design  of  Fault-Tolerant 
Processor  Arrays  (FTP As)  involves  determining  the  amount  of 
redundancy  needed  to  meet  requirements  of  the  Mean  Time  To 
Failure  (MTTF)  or  to  optimise  the  MTTF.  Given  a  desired 
value  of  MTTF  and  an  FTPA  architecture,  the  necessary 
number  of  spares  (NNS)  is  defined  as  the  minimum  number  of 
spares  needed  to  achieve  the  desired  MTTF.  The  optimal 
number  of  spares  (ONS)  is  defined  as  the  number  of  spares  for 
which  the  MTTF  of  the  FTPA  is  maximised.  This  paper  intro¬ 
duces  reliability  and  MTTF  models  of  different  FTP  As.  Based 
on  these  models,  approaches  which  allow  for  the  analytical  esti¬ 
mate  of  the  NNS  and  the  ONS  are  proposed.  Knowledge  of 
NNS  is  suited  for  FTPAs  where  non-redundant  hardware 
(hardware  for  which  no  redundancy  is  provided)  is  considered 
nearly  fault-free.  Knowledge  of  ONS  is  useful  when  faults  can 
affect  the  non-redundant  hardware,  because  in  this  case  overall 
array  reliability  may  actually  decrease  when  the  number  of 
spares  increases  beyond  some  value.  The  quick  estimates  pro¬ 
vided  in  this  paper  can  be  used  to  help  designers  in  the  early 
phases  of  design  of  an  FTPA. 


1.  Introduction 

For  processor  arrays  of  large  size,  there  is  a  high  probabil¬ 
ity  of  failure  of  one  or  more  processors.  This  implies  that  the 
value  of  mean  time  to  failure  (MTTF)  may  be  very  small  for 
these  processor  arrays.  In  a  fault-tolerant  processor  array 
(FTPA),  spare  processors  are  added  and  a  fault- tolerant  scheme 
(or  reconfiguration  scheme )  is  used  to  reconfigure  the  intercon¬ 
nections  in  order  to  replace  faulty  processors  when  faults  occur. 
Besides  the  spare  procesac-s,  some  extra  hard  ware  such  as  inter¬ 
connections.  switches  and  control  circuits  is  added  to  support 
the  reconfiguration.  Extensive  work  has  been  done  towards 
designing  FTPAs  and  typical  schemes  such  as  Column  Redun¬ 
dancy  (CR)  '5j,  Diogenes  (DI)  [1|,  Complex  Fault  Stealing 
(CFS),  and  FUSS  [4|  are  of  particular  interest  in  this  paper. 

In  order  to  design  FTPAs,  it  is  necessary  to  choose  one  of 
many  possible  reconfiguration  schemes  and  to  decide  how  many 
spares  are  needed  to  achieve  either  maximum  MTTF  or  a 
prescribed  value  of  it.  Detailed  design  and  MTTF  evaluation  of 
ail  possible  FTPA  architectures  are  too  time-consuming  and 
cannot  be  used  to  help  a  designer  make  the  above  decisions. 
Instead,  it  is  desirable  to  use  valid  estimates  of  MTTF  and  the 
number  of  spares  that  can  be  quickly  computed  for  different 
reconfiguration  schemes.  Based  on  these  estimates,  a  designer 


This  research  was  supported  in  part  by  the  Innovative  Science 
and  Technology  Office  of  the  Strategic  Defense  Initiative 
Organisation  and  was  administered  through  the  Office  of  Naval 
Research  under  contracts  No.  00014-85-k-05l8  and  No.  00014- 
SS-k-0723. 


can  choose  an  initial  FTPA  architecture  that  can  then  be 
designed  and  evaluated  in  detail  in  order  to  confirm  and  refine 
the  estimates.  This  paper  addresses  the  important  questions  of 
how  much  redundancy  is  necessary  to  achieve  a  required  MTTF 
and  what  is  the  optimal  redundancy  amount  that  maximises 
MTTF. 

The  cost  of  an  FTPA  increases  with  the  number  of  spares 
and  the  extra  hardware  needed  for  reconfiguration.  Given  a 
desired  value  of  MTTF  and  an  FTPA  architecture,  the  neces¬ 
sary  number  of  spares  (NNS)  is  defined  as  the  minimum 
number  of  spares  needed  to  achieve  that  MTTF.  If,  Cor  a 
prescribed  MTTF,  the  number  of  spares  is  much  greater  than 
the  necessary  number  of  spares,  the  spares  in  excess  will  render 
the  FTPA  unnecessarily  expensive.  Moreover,  because  the 
amount  of  extra  hardware  increases  with  the  number  of  spares, 
it  is  not  guaranteed  that  the  additional  redundancy  always 
improves  the  MTTF  of  an  FTPA.  The  decrease  of  the  reliabil¬ 
ity  of  the  extra  hardware  may  reduce  the  MTTF  significantly, 
particularly  when  the  size  of  the  array  is  very  large.  Thus,  it  is 
not  always  true  that  the  larger  the  number  of  spares  the  higher 
the  MTTF  of  the  array.  The  optimal  number  of  spares  (ONS) 
is  defined  as  the  number  of  spares  Cor  which  the  MTTF  of  the 
FTPA  is  maximized.  Different  reliability  models  and  MTTF 
models  of  FTPAs  in  terms  of  the  number  of  spares,  survivabil¬ 
ity  of  the  reconfiguration  scheme,  reliability  of  a  single  proces¬ 
sor  and  other  parameters  are  proposed  in  this  paper  in  order  to 
analyze  and  find  the  NNS  and  ONS.  The  results  obtained  by 
the  proposed  estimation  models  are  quite  accurate  when  com¬ 
pared  with  exact  values  obtained  by  extensive  computer  simu¬ 
lations. 

The  paper  is  organized  as  follows.  Basic  models  of  FTPAs 
and  faults  are  introduced  in  Section  2.  In  Section  3.  MTTF 
evaluation  models  are  discussed  and  the  effects  of  hardware 
redundancy  on  MTTF  and  reliability  are  compared.  The  esti¬ 
mation  of  the  necessary  number  of  spares  is  presented  in  Sec¬ 
tion  4.  Section  S  discusses  and  proposes  an  MTTF  model  for 
the  FTPAs  where  failures  of  interconnections,  links,  logic  con¬ 
trol  circuits  and  other  hardware  elements  are  taken  into  con¬ 
sideration,  A  procedure  for  estimating  the  optimal  number  of 
spares  is  proposed  in  Section  6.  Section  7  concludes  the  work 
reported  in  this  paper. 

2.  Reliability  Models  for  FTPAs 

For  the  purposes  of  this  paper,  an  FTPA  with  N1  —  k  pro¬ 
cessors  is  modeled  as  AF  ■*-  k  identical  modules  each  of  which 
consists  of  two  parts.  One  part  contains  hardware  for  which 
redundancy  is  provided;  typically  it  corresponds  to  a  processing 
element  but,  in  general,  it  may  also  contain  part  of  the 
hardware  used  for  controi,  communication  and  reconfiguration. 
With  this  understanding,  this  part  of  the  module  is  herein 
referred  to  as  the  processing  element  (PE).  The  other  part  of 
each  module  contains  hardware  for  which  no  redundancy  is 
provided.  Typically,  it  includes  switches,  links  and  control  cir- 


CH  2877-94000000184/901.00  - 1990  IEEE 


184 


cuitry  for  communication  and  reconfiguration  purposes.  In  this 
paper,  the  hardware  of  the  modules  in  the  array  without  redun¬ 
dancy  is  collectively  referred  to  as  the  non- redundant  hardware 
and  processing  elements  are  ealled  redundant  hardware  because 
k  spare  PEs  are  provided. 

In  'his  paper,  a  failure  of  any  component  is  assumed  to 
cause  the  failure  of  the  entire  processor  array  unless  it  can  be 
replaced  by  a  functional  spare.  So,  two  conditions  must  be 
satisfied  for  an  FTPA  to  be  successfully  reconfigured  when 
faults  occur.  First,  the  reconfiguration  scheme  must  succeed 
wnen  the  given  number  and  pattern  of  faults  are  present.  The 
conditional  probability  of  a  scheme  being  able  to  reconfigure 
the  array  given  there  are  i  faulty  PEs  in  the  array  is  eailed  pro¬ 
bability  of  survival  or  survivability  and  is  denoted  s,.  iecond, 
the  mechanisms  that  support  reconfiguration  must  worx  prop¬ 
erly,  i.e..  the  non-redundant  hardware  must  be  fauit-free. 

Faults  in  an  FTPA  are  assumed  to  be  uniformly  distri¬ 
buted  and  independent,  in  other  words,  the  failure  of  a  given 
PE  never  causes  failure  of  other  PEs  or  non-redundant 
hardware  component  and  vice  versa.  Furthermore,  it  is 
assumed  that  failure  of  a  non-redundant  component  results  in 
failure  of  the  entire  processor  array.  Then  the  overall  reliabil¬ 
ity  of  an  array  can  be  modeled  as 

R=Re-Rn  (1) 

where  Rr  is  the  reliability  of  redundant  hardware  and  12,  is  the 
reliability  of  the  non-redundant  hardware. 

Without  loss  of  generality,  let  a  unit  area  of  the  hardware 
be  defined  as  the  area  occupied  by  a  single  PE.  The  average 
area  of  non-redundant  hardware  in  each  module  is  assumed  to 
be  a,  and,  therefore,  the  total  area  of  the  non-redundant 
hardware  of  the  FTPA  is  A,  =  (iV-Hfc)  a,.  Also  let  the  reliabil¬ 
ity  of  the  hardware  in  a  unit  area  be  e  ~A  ‘  where  -\  is  the  failure 
rate  of  the  hardware  of  a  unit  area  and  t  is  the  operation  time 
variable.  Then  the  reliability  of  the  non-redundant  hardware  is 
£„({)  =  e~A'*‘.  If  the  operation  time  t  is  a  constant  that  is 
normalized  to  unity,  R,  also  can  be  written  as  R,  =  e~*'  A. 

The  reliability  of  the  redundant  hardware  R.[t)  is  defined 
as  the  summation  of  the  probabilities  that  the  reconfiguration 
scheme  survives  ■  failures  for  ail  possible  fault  distributions  ana 
for  i  =0,  l,  ...  k.  where  k  is  the  number  of  spare  PEs.  Then, 

R,(t)  =  £  St  N‘fk  -  e-v‘)<  . 

i-0 

The  combination  term  in  the  above  equation  accounts  for  ail 
possible  fault  distributions  with  t  faults.  The  overall  reliability 
of  an  FTPA  with  tV-^k  PEs  can  be  modeled  as 

*(<)-*«( S e-<*+*-P‘(l—e -“y.  (2) 

Notice  that  not  every  failure  of  the  redundant  hardware  causes 
the  whole  array  to  fail  because  the  faulty  redundant  hardware 
may  be  replaced  by  spares.  The  fault  which  causes  the  whole 
array  to  fail  is  called  the  fatal  fault  (which  may  occur  after  ail 
spares  are  exhausted). 

For  some  reconfiguration  schemes,  such  as  the  Diogenes 
l|  and  FUSS  [4|  schemes,  the  survivability  j,-  is  almost  con¬ 
stant  and  close  to  one.  Let  S  denote  the  constant  value  of  sur¬ 
vivability  sit  then  the  reliability  of  this  kind  of  FTPA  can  be 
modeled  as 

R{t)  =  e -*•**  S  £  ^**1  e-<w’+‘->x,(1  -  e-“}'  .  (3) 

i-o{  *  - 

However,  for  some  other  reconfiguration  schemes,  such  as  the 
CFS  scheme  and  the  CR  scheme  2j  (5j,  the  survivability  is 


not  a  constant  but  a  variable  in  the  range  0  <  s,<  1.  Because 
there  is  no  general  closed  form  to  express  the  survivability  for 
thia  kind  of  scheme,  computer  simulations  can  be  used  to 
obtain  the  survivability  s,-  (2).  Due  to  article  size  limitations, 
thia  paper  only  considers  schemes  for  which  the  survivability  j,- 
can  be  approximated  by  a  constant  5.  The  discussion  of  the 
cases  where  the  survivability  *,•  is  not  constant  can  be  found  in 

[io|- 

In  some  previously  proposed  FTPA  designs,  it  is  assumed 
that  the  non-redundant  hardware  is  perfect  and  never  fails. 
This  assumption  is  based  on  the  fact  that  these  non-redundant 
hardware  components  are,  compared  with  redundant  hardware, 
smaller  and  simpler  and  therefore  can  be  designed  carefully  and 
conservatively  so  as  to  minimize  the  probability  of  operational 
failure.  Therefore,  when  the  amount  of  the  non-redundant 
hardware  is  small,  the  non-redundant  hardware  can  be  assumed 
to  be  almost  fault-free  or  R„(t)  :=  1  [  1  j  ”2].  The",  the  overall 
reliability  of  an  FTPA  can  be  written  as 

R(t)  =  5  •  £  -e— <*'-*-)>.<( i  -  e-A,)‘  .  (4) 

i-0 

For  large  enough  processor  arrays,  non-redundant  hardware 
can  not  be  assumed  to  be  fault-free.  The  general  reliability 
model  of  the  FTP  As  in  which  Rn  <  1  is  studied  later  in  Section 
5. 

3.  MTTF  Model  of  Processor  Arrays 
The  MTTF  should  not  be  confused  with  the  reliability 
function  R(t).  As  shown  in  Section  2,  the  reliability  function 
R{t)  is  the  probability  that  the  array  has  no  fatal  faults  during 
the  time  period  (0,  i|  and  is  a  function  of  operation  time  L 
However,  the  MTTF  is  independent  of  the  operation  time  t  and 
can  also  be  written  as  the  integration  of  the  reliability  of  the 
array  as  [3): 

30 

MTTF  =  f  R[t)dt  .  (51 

o 


3.1.  Harmonic  series  model 

For  FTP  As  where  the  reliability  can  be  modeled  by  (4), 
according  to  (5),  the  MTTF  can  be  written  as 


"5£  "■is;  (~lV  HV+k-i+j)  ■ 

According  to  [8|,  the  summation  in  the  above  equation  can  be 
simplified  as 


MTTF  =.  S  £  jV[+*  •/  e (i  _  e-iy  dt 
»-o  Jo 

w  S  £  N>~*'k  >|  dt 

i-0  '  jo  ;-0lJJ 

=»S  £  *1*  £  *| (-1  y  /  e -(*•+*-.>, )x« dt 

«•— A  1  .-—A  n 


vf'lf-l)'  — L 

1  (iy»+ 


(iv,+»  ^jv+iXAc+awJV+i) 

So,  the  MTTF  can  be  written  as 


MTTF  «  j  £  ^fk 


(N'+k—i}(WJ-ie—i+l)...(iV+k)  ’ 


183 


Increment  rate  A,  and  A„ 


MTTF  =  f  ^ 


X 


iV»+*! 

i 


t!(iV,-*-fc  — t— l)! 
{/V+*)l 


Because 


N’-k 


(ff+ky. 
(iV-r*  —«')!«!  ’ 

* 


]{TTr-S  V  _5^ 

\~i  (AF-Hk— t)!s'!(iV,+i)!  '  -1 


a  -  jV-t-i 


or 


Sf  •  1  .  * 

X  1  iV  ""  <V»-K  ^  iV-f-2 


(8) 


The  MTTF  model  in  (6)  also  can  be  written  as  a  difference  of 
two  Harmonic  series  as 

<?  .V**  i  .V*-l  f  ij 

MTTF  =  -f  (  ^  -  -  E1)  =  t(  -  H*-1 )  W 

■'  .-l  1  l-l  1  •' 

where  //,  =  T]  —  =  —  —  —  -r-i+.„  -7  —  is  the  Harmonic 
i  123  n 

series.  The  MTTF  model  in  (7)  is  called  the  Harmonic  series 
model. 


3.2.  Comparison  of  the  models  of  MTTF  and  reliability 

How  the  spare  PEs  improve  the  MTTF  is  very  different 
from  the  way  they  improve  the  reliability  in  large  site  FTP  As. 
According  to  Equation  4,  the  reliability  of  an  FTPA  with  .V 
PEs  and  k  spares  is 

R~S-(  ry+k  +  (jV+k) 

(W»-WkXW»+*- 1)  w**.,  • 

- -  - .r  .j  t....)  , 


where  r=c~vt  is  the  reliability  of  a  single  PE  and  q  =  1  —  r.  If 
it  is  assumed  that  r  »q,  <V*  »k  and  S  is  1,  then  the  reliabil¬ 
ity  of  the  array  can  be  approximated  as 


R  cr  1  + 


iV? 


.  _L 

'  r  '  2< 


4-  .... 


Intuitively,  the  first  item  in  the  above  equation  is  the  reliability 
of  the  array  without  any  replacement.  The  second  item  in  the 
equation  can  be  thought  of  as  the  reliability  increment  for  the 
case  when  the  first  faulty  PE  can  be  replaced  or  be  recovered. 
The  third  item  is  the  reliability  increment  if  the  second  faulty 
PE  also  can  be  replaced  and  so  on.  Let  A.(  t)  be  the  increment 
rate  which  is  the  ratio  of  the  reliability  increment  contributed 
by  the  replacement  of  ith  faulty  PE  to  the  reliability  of  the 
corresponding  array  without  any  replacement.  Figure  1  shows 
that  A,(  t)  strongly  depends  on  the  failure  rate  X  and  i  and  it 
decreases  very  quickly  with  the  number  of  spares  when  X  is 
imail.  This  implies  that  a  significant  reliability  improvement  is 
unlikely  to  result  from  increasing  the  number  of  spare  PEs  and 
the  reliability  cannot  be  improved  constantly  by  successively 
adding  spares  for  FTP  As  with  very  small  failure  rate  X. 

However,  the  MTTF  can  be  improved  almost  constantly 
when  AA  »  *  and  the  improvement  rate  contributed  by  the 
spare  PEs  is  almost  independent  of  X  and  i.  Let  A*,(i)  denote 
the  increment  rate  which  is  the  ratio  of  the  MTTF  increment 
contributed  by  the  replacement  of  ith  faulty  PE  to  the  MTTF 
of  the  corresponding  array  without  any  replacement.  Figure  1 
shows  that,  A^(t)  is  approximately  a  linear  function  of  the 
number  t  regardless  of  the  value  of  X.  For  example,  consider  an 
FTP  A  where  AA  ■»  100,  X  —  10  * ,  Sa  l  and  k  «  l  (which 
means  only  one  faulty  PE  can  be  replaced),  then  the  MTTF  of 
the  array  with  one  spare  PE  is 

MTTF*.,  =!(-!■  --J_),i9»cs2-MrrF*_o  (8) 


Figure  1:  The  dotted  lines  indicate  the  variation  of  the 
increment  of  reliability  A,(i)  with  number  of  spares  i  The 
solid  line  indicates  the  variation  of  the  increment  of  MTTF 
A„(  i)  with  i.  The  logic  sise  of  the  FTPA’s  is  AA  =  100  and 
operation  time  is  t  =  1.  It  shows  that,  A.(«)  is  much  dependent 
on  the  failure  race  X  and  i,  but  Am(t)  is  almost  independent  of 
X  and  f. 


where  MTTF%ma  =  l /(AFX)  =  100  is  the  mean  time  to  failure  of 
the  simplex  array  without  any  redundancy.  Equation  8  shows 
that  there  is  almost  a  100  percent  improvement  in  the  MTTF 
for  the  FTPA  with  only  one  spare  PE.  However,  for  the  same 
FTPA  with  one  spare  PE,  there  is  only  about  1  percent 
improvement  in  the  reliability  with  unity  operation  time  (  1. 

In  summary,  the  MTTF  can  be  improved  by  adding  spare 
PEs  when  jV  »  k  and  the  improvement  race  is  independent  of 
the  number  of  spares  and  the  failure  rate  X.  In  contrast,  the 
reliability  improvement  per  added  spare  PE  depends  on  the 
number  of  spare  PEs  and  the  failure  rate  X. 

3.3.  Simplified  form  of  MTTF 

Because  there  is  no  closed  form  expression  for  the  Har¬ 
monic  series  model  of  the  MTTF, 


MTTF 


S,_l_  1  ,  1  .  1  , 

V  .V  .V-M  .V-*-2  iV-k' 

|(  Afy+*  -  )  , 


further  simplification  of  the  MTTF  model  may  be  needed  in 
order  to  find  the  analytic  expression  for  the  number  of  spares 
which  is  required  for  a  given  FTPA  structure. 

According  to  [8j,  H%  can  be  written  as  another  series 
which  may  contain  an  infinite  number  of  items  as 


H%  =>  Inn  -f-  7  •+■  — — 


l 

12  n: 

£f,  —  Inn  —  7  —  ~ - ^ 

?»  12nJ 


1 

120  n4 
1 


I20n‘ 


where 


0  <  <  <  1/(252  n*>  and  7  -  0.5772156549...  is  the  Enter ’s  con¬ 
stant.  Then  the  MTTF  can  be  expressed  as 


MTTF  =.-i(  (Inf  W-L-k)-*-  7 


1 


2(AA+i)  L2(AA+*)J 


-  (ln(AA-l)  -  7  • 


1 


(•) 


2(AA-l)  1*2(AA— l)1 

The  main  advantage  of  this  expression,  compared  to  the  Har- 


186 


IMIHMI*** 


monic  series  model  in  17)  is  that  the  series  in  (9)  converges 
much  more  quickly  than  that  in  (7)  for  those  FTP  As  having  a 
irge  number  of  PEs  Y.  Assuming  the  number  of  PEs  in  an 
FTP  A  s  large  enough  so  that  inly  the  first  three  items  in  series 
9)  are  significant  enough  to  ipproximace  the  MTTF  and  let- 
:ing  V  —  la.V  it  follows  that 


MTTF  cr  -  ini  -4-^1  -  ,  *  - 

\  A  2Y(Y+i) 


(10) 


This  closed  form  expression  of  MTTF  is  much  simpler  than  the 
ne  in  i7).  The  error  of  this  approximation  is  which 

:an  be  neglected  for  large  3iie  FTP. As.  The  comparison  of 
these  two  expressions  with  different  values  of  Y  and  k  is  given 
n  Table  l.  The  estimation  of  the  necessary  number  of  spares 
NNS  required  for  an  FTPA  to  achieve  a  prescribed  value  of 
MTTF  is  discussed  in  the  next  section  by  using  the  model  in 
:  10). 


Loeic  numoer 

Numoer  of  30ares 

MTTF 

MTTF*  | 

.’00 

50 

227.6443 

222.6436 

1  200 

100 

409.6329 

404.6318 

i  200 

150 

563.5458 

558.5443 

1  200 

200 

696.8987 

691.8972 

i  400  ; 

50 

120.1443 

117.6441 

400 

100 

225.3937 

222.3936 

100 

150 

320.6131 

.318.1128 

too 

200 

407.5487 

405.0484 

800 

50 

41.6453 

79.9786 

800 

100 

155.6984 

154.0316  I 

800 

150 

224.6436 

222.9769  | 

800 

200 

239.1405 

287.47.38  1 

Table  1:  The  values  of  MTTF  are  obtained  from  the  Harmonic 
model  in  |8)  and  the  values  marked  by  MTTF*  are  obtained 
from  approximation  model  in  i  10).  The  failure  rate  is 
\  =  0.001  and  survivability  is  S  =  1. 


4.  Necessary  Number  of  Spares 
Given  a  desired  value  of  MTTF  (preserved  MTTF)  and 
an  FTPA  structure  (the  logic  iize.  fault-tolerant  scheme,  etc.), 
the  necessary  number  of  spares  I  NNS)  is  defined  as  the 
minimum  number  of  spares  needed  to  achieve  the  prescribed 
MTTF.  For  a  prescribed  value  of  MTTF,  if  the  number  of 
spares  in  an  FTPA  is  much  greater  than  the  necessary  number 
of  spares,  the  spares  in  excess  will  render  the  array  unneces¬ 
sarily  expensive.  In  this  section,  estimates  of  NNS  of  FTPA 
structures  with  constant  survivability  are  proposed. 

Let  M,  denote  the  prescribed  value  of  MTTF  of  an  FTPA. 
According  to  the  model  In  ( 101,  the  number  of  spares  should 
satisfy  the  equation 


,,  S ,,,  Y-i, 


>r 


* 


(id 


Let  M  =  IM,-\)/S  and  6  =  i/(2Yf.Y— 4)),  then  Equation  11  can 
be  rewritten  as  IiVj-D/iV  =  e:4  ‘.' .  Hence  the  necessary 
number  of  spares  <  for  which  the  array  can  achieve  the 
prescribed  value  M,  is 

k  =  .V(e  v'  e-  -  l)  .  (12) 

This  is  not  yet  a  final  solution  of  the  NNS  because  6  itself  is  a 


function  of  k.  In  the  rest  of  this  section,  further  simplification 
and  estimation  of  the  solution  of  Equation  12  are  discussed. 

To  estimate  the  solution.  Equation  12  is  studied  in  three 
different  cases.  In  the  first  case,  the  value  of  M  is  small  and  in 
the  second  case,  the  value  of  M  is  large.  The  third  case  consid¬ 
ers  FTPA  designs  which  most  frequently  happen  in  practice, 
and  therefore  it  deserves  more  attention. 

In  the  first  case,  it  is  assumed  that,  for  a  given  FTPA 
structure  and  the  prescribed  MTTF,  the  value  M  =  (M?\)/S is 
very  small  such  that  M <  1/Y « 1.  According  to  (6),  the 
MTTF  of  an  array  without  redundancy  ( k  =  0 )  is 
MTTF  =  S/(Y-X).  This  implies  that  a  prescribed  MTTF  of  an 
array  is  at  least  .Vf,>S/(Y-X)  and  it  must  be  equal  or  greater 
than  the  MTTF  of  the  array  without  any  redundancy.  In  other 
words,  the  value  M=4Mf  \)/S  of  any  array  is  at  least 
M  —  1/Y.  Therefore,  for  a  given  FTPA  structure,  when  the 
prescribed  MTTF  M,  is  such  that  M=(Mp  \)/S <  1/Y,  no 
redundancy  is  needed  for  the  array  and, 

4  =  0  .  (13) 

The  second  case  includes  all  FTP  As  where  the  prescribed 
MTTF  is  large  3uch  that  M  >  1  / Y.  Since 
J  =  4/(2Y(Y— 4))  <  l/(2Y)  «  1  usually  is  very  small  for 
large  size  arrays,  it  has  eJcrl  —6  where  the  error  is  0(5*) 
which  is  0(  1/(Y*)).  Then  Equation  12  can  be  approximated  as 
4  =  Y(e‘w-(l  -i-a)  —  l).  Because  J  «  l  and  l  —  o  ~  1,  it  yields 

4  ^  Y(«"  -  1)  . 

Thus,  for  a  given  FTPA  with  M  =  (M,  '\)/S>  1/Y.  the  neces¬ 
sary  number  of  spares  NNS  for  the  array  to  achieve  the 
prescribed  value  of  MTTF  can  be  estimated  as  an  exponential 
function  as 

4  =  [  Y(e!w*  M/S  -  1)  |  •  -(14) 

The  estimation  of  the  NNS  by  Equation  14  for  the  FTP  As 
with  different  values  of  M  and  Y  is  shown  in  Figure  2  by  dot¬ 
ted  lines.  As  a  comparison,  the  accurate  values  of  the  NNS  are 
indicated  by  x  in  Figure  2.  These  accurate  values  are  obtained 
by  a  computer  program  in  which  the  calculation  of  Equation  6 
is  repeated  for  different  values  of  4=1.  2,  ...  until  the  result  of 
Equation  8  reaches  the  desired  MTTF  for  a  given  FTPA.  It  is 
clear  that  Equation  14  provides  a  good  estimation  on  the  NNS 
for  the  FTP  As  with  large  values  of  M. 

The  first  case  does  not  happen  very  often  and  can  be 
ignored  in  large  size  fault-tolerant  processor  array  designs.  This 
is  because  if  M  =  (M,  \)/S  <  1/Y.  the  prescribed  MTTF 
M,  =  5/(Y’\)  is  very  low  when  the  logic  size  of  the  array  Y  is 
large  and  in  practice,  the  prescribed  MTTF  is  expected  to  be 
much  larger  than  this  value. 

The  second  case  where  .Yf  is  larger  than  or  comparable  to 
one  also  does  not  happen  very  often  in  practice.  This  is 
because  for  an  FTPA  structure  with  given  survivability  S  and 
failure  rate  of  a  single  PE  X,  a  large  value  of  ,\f=(Af,-X)/S  may 
require  a  huge  number  of  spares  which  is  unreasonably  expen¬ 
sive  to  implement.  For  example,  consider  an  FTPA  with  logic 
sise  Y  =  400  and  5=1:  according  to  Equation  14,  almost  700 
spare  PEa  are  required  for  the  array  if  .If  =  1  is  expected.  So  it 
may  be  too  expensive  to  implement  an  FTPA  to  achieve  the 
prescribed  MTTF  which  results  in  a  large  value  of  M.  Actu¬ 
ally,  the  prescribed  MTTF  M,  such  that  .Yf=(Af,.\)/5  is  very 
high  may  noe  be  achievable  for  large  size  FTP  As  if  the  failure 
of  non-red  undent  hardware  is  considered  aa  indicated  later  in 
Section  8.  Experience  with  numerical  simulations  shows  that, 
for  most  FTPA  designs,  the  prescribed  MTTF  Mf  is  such  that 
.Yf  Mf\)/S  is  in  a  special  range  where  .Yf  is  neither  as  large  aa 
comparable  to  one  nor  smaller  than  1/Y.  Because  it  applies  to 


187 


NNS 


1000 
300 
6  00 
400 
200 

0.5  l  1.5  2 

Af=Af,X/S 

Figure  2:  The  dotted  lines  are  obtained  from  exponential 
estimation  model  in  (14)  with  different  values  of  .V  and  Af. 
The  points  marked  by  x  are  the  accurate  values  of  NNS 
required  for  the  FTPAs  obtained  by  simulation  program 
according  to  (8). 


most  FTP  A  designs,  this  special  range  of  Af  which  is  a  subcase 
of  ease  two  is  discussed  next  as  the  third  case,  and  a  simpler 
linear  equation,  instead  of  (14)  can  be  used  to  estimate  the 
NNS  for  this  special  case. 

The  third  case  applies  to  most  FTPAs.  Typically  in  this 
case,  0.1>Af>l/N'.  Because  Af  is  relatively  small,  the 
approximation  formula  e*  xs  1  +  x  for  small  x  can  be  used  for 
e •**  and  Equation  14  can  be  approximated  as 

*  -  ((1  +  Af)  -  1)]  -  [fV'-Afj  .  (15) 

This  provides  a  very  simple  linear  estimation  formula  of  the 
NNS  for  moat  of  the  FTPAs 

*  -  [((Af,-X)/S)-N-]  .  (i8) 

It  shows  that,  for  most  FTPAs  with  given  5  and  X,  the  acces¬ 
sary  number  of  spares  to  achieve  the  prescribed  MTTF  is 
almost  proportional  to  the  logic  site  .VT  and  the  prescribed 
MTTF  Af,. 

Figure  3  compares  the  estimation  models  in  (18)  and  (14) 
to  the  accurate  values  of  the  NNS  obtained  by  Equation  8.  The 
dashed  lines  in  Figure  3  show  the  estimation  of  the  NNS 
according  to  Equation  18,  and  the  dotted  lines  show  the  estima¬ 
tion  of  the  NNS  obtained  by  Equation  14.  The  accurate  values 
of  the  NNS  obtained  from  the  computer  program  according  to 
Equation  8  are  marked  by  x.  As  Figure  3  illustrates,  the  linear 
Equation  18  can  provide  a  quite  good  estimation  of  the  NNS 
when  Af  is  small.  Also,  (18)  is  a  linear  approximation  of  (14) 
when  Af  is  small.  When  Af  is  large  or  AfXJ.l,  estimation  model 
(14)  provides  better  results  than  the  one  in  (18).  The  estimation 
error  of  model  ( 18)  increases  with  Af. 

As  a  summary,  for  a  given  prescribed  MTTF  Af,  of  an 
FTP  A  with  nearly  perfect  non-red  undent  hardware  [R%  —  1) 
and  constant  survivability  S,  the  NNS  can  be  estimated  by 
three  different  formulas  according  to  the  value  of 
Af  -  (Af,-X)/S. 

(1)  If  Af  =—  (Af,-A)/S  <  l/Z'F,  no  spare  PEs  are  needed  for  the 
array  and  i>0. 

(2)  If  Af  >  1/N*,  the  NNS  can  be  estimated  by  an  exponential 

function  as  it  *  —  1). 


188 


NNS 


Af-Af,X/S 

Figure  3:  The  dashed  lines  are  obtained  from  linear  estimation 
equation  (18)  and  the  dotted  lines  are  obtained  from 
exponential  estimation  equation  (14).  The  points  marked  by  x 
are  accurate  values  of  NNS  of  the  FTPAs  achieved  by 
simulation  program  according  to  (6).  It  shows  that,  the  error 
of  linear  estimation  equation  (16)  increases  with  the  value  of  Af 
of  the  FTPAs. 


(3)  If  0.1  >  Af  >  1  //V,  the  NNS  can  be  estimated  especially 
by  a  linear  function  k  =  Af'N*. 

All  estimations  of  the  NNS  above  are  based  on  the 
assumption  that  the  non-red  undent  hardware  of  FTP  A  is  sim¬ 
ple,  takes  a  small  area  and  is  almost  perfect  such  that  Rr  =  1. 
The  general  case  where  Rr  <  1  is  discussed  next. 

5.  General  Model  of  MTTF 
Because  the  amount  of  non-redundant  hardware  of 
FTPAs  is  usually  proportional  to  the  sise  of  the  array  and 
increases  with  the  number  of  the  PEs  in  the  array,  the  reliabil¬ 
ity  of  non-redundant  hardware  may  decrease  drastically  and 
become  too  low  to  ignore  for  very  large  site  FTPAs.  .As  indi¬ 
cated  by  (1),  the  overall  reliability  is  the  product  of  the  reliabil¬ 
ities  of  the  non-redundant  and  redundant  hardwares.  There¬ 
fore,  for  very  large  sise  processor  arrays,  a  valid  estimation  of 
the  desired  amount  of  the  redundancy  must  take  the  influence 
of  the  non-redundant  hardware  into  consideration.  In  these 
cases,  instead  of  R,  o:  1,  the  expression  •  e~A'kl  should  be 
used  to  establish  more  accurate  reliability  models  for  large  sise 
FTPAs,  where  A.  is,  as  defined  in  Section  2,  the  total  area 
taken  by  non-redundant  hardware.  Therefore,  for  the  general 
case,  instead  of  Equation  4,  Equation  3  should  be  used  to  study 
and  derive  MTTF  models,  i.e., 

R(t)  -  5-e'^  £  •e-<y+*->w(l  -e~uY  . 

t'—Q 

According  to  Equation  5,  the  general  MTTF  model  can  be 
obtained  as 

90 

MTTFm  fR(t)H 

0 


k 

-5^ 


-5  S 

i-0 

k 

*s  s 

i-0 

k 

-5  S 

i—O 


iV-i 

t 

iV-ufc 

iV-t 

I 

jV+i 
I 


.f  ,-****+*'-»* H  ~'-“Ydt 
0 


*  /  —  {N* +k-*A,— t)X< 


*  (-i)'  «-'xV 


.  V 


.  V 

;-0 


(-1)'  /  * 

0 


-I.V+i+A.-.-t-flXI 


it 


(-IK 


(17) 


M  <v^-a„-«>;) 

lecause  the  summation  of  the  combination  in  (17)  can  be 


simplified  as  V} 
;— 0 


(-1)' 


T  !8) 


;-«v  /  (£d*j)  C(C-i-l)(C’4-2)...(<7+t) 

where  C  is  not  necessarily  integral,  the  MTTF  of  FTP  As  can 
be  expressed  as 


1 V 

'.To 


I 


(jV^*+.4,-.)(lV+*-.4.T-l-i)...(lV»+Jfc+A.)  ‘  ^18) 


The  total  area  of  non-redundant  hardware  A„  should  not  be 
assumed  to  be  very  small  for  large  site  FTP  As  in  this  section, 
otherwise  the  model  in  Equation  10  can  be  used  to  estimate  the 
MTTF  as  diseused  in  Section  4.  To  simplify  (18),  the  total 
area  of  non-redundant  hardware  A„={fV,-‘-i)a,  can  be  approxi¬ 
mated  as  the  nearest  integer  3  =>  iA,J  >  1.  This  is  because 
usually  iM  is  large  and  the  relative  error  between  /V  4-  A,  and 
V*  •+•  9  is  very  small.  Then  the  equation  above  can  be  simplified 
as 


MTTF 


v  * 

£.  v 


iV+k 

« 


(iV+fc+B)! 


(19) 


Because 


iM-t-tl 


:  I  iy»4-t)! 

(lM+*-.)!i!  ’ 


UTTF  Jf 

<V-!-*-.,)!,!(lV-lt+fl)! 

.  S  (Al*-*-*)!  <*,  (AFa-k+B-l-,’)! 

"  \  (iVa-fc+a)!  ^  (N*+*-«)! 

-  S  (jV,^+l-0(/V^+2-i).-(lV'+*+B-t-,) 

S(«,+>+'F^W^-i+<) 


-  5  (/M-i-41! 


Because  B  >  l,  then 

5  (iM-fAl! 


.V-H+l 

V  i(  i  -t- 1 )( «'  — 2)... .( i  +B— 2) 


(20) 


MTTF* 


\  (A/»+*-rB)! 

.V 

(  2  i ( ,” -t- 1 )...( « +B— 2)—  V] » (i+1  )•••( •’ +B  —2))  (21) 

■ -i  i-i 

It  is  easy  to  prove  by  induction  that  £  «'(i-t-l)...(«+4)  = 
l  ‘ 

7 — r  -t-1)  ,8),  so  Equation  21  can  be 

*  , —l 

simplified  as 

3  3 


■'iTTF’r-< 


,w— ft 


—(1  - 

\BV  (ACu-i-fl)!  ^ 


(22) 


Because  the  area  of  non-redundant  hardware  3  used  as  a 


denominator  in  (22)  is  not  required  to  be  an  integer,  the  accu¬ 
rate  value  A,  can  be  used  to  substitute  B.  Then,  the  general 
model  of  MTTF  can  be  written  as 

"TTF  sbl i»i 

The  error  caused  by  using  integer  B  to  replace  the  total 
area  of  non-redundant  hardware  is  negligible  because  the 
difference  between  3  and  A„  is  less  than  or  equal  to  0.5. 


S.  Optimal  Number  of  Spares 


As  discussed  in  Section  1,  because  non-redundant 
hardware  can  also  fail  and  the  non-redundant  hardware 
increases  along  with  the  number  of  spares,  it  is  not  always  true 
that  the  larger  the  number  of  spares,  the  higher  the  MTTF  of 
the  FTPA.  As  indicated  by  (2),  when  the  number  of  spares 
increases,  the  reliability  of  the  redundant  hardware  increases 
and  the  reliability  of  the  non-redundant  hardware  decreases.  At 
the  beginning,  the  overall  reliability  increases  along  with  the 
number  of  spares  and  it  decreases  when  the  number  of  spares  is 
beyond  some  value.  Therefore,  there  is  a  particular  number  of 
spares  for  which  the  MTTF  is  maximized.  This  section 
discusses  how  to  find  the  ONS  by  using  the  models  proposed  in 
Section  4. 


Let  X(k)  = 


VA. 


M  N,+*)«, 


J 


Equation  23  and  let  Y(k)  =  (1— JJ 


denote  the  first  part  of 

~~ iir)  be  the  second 
:V+fc+t 


part  of  Equation  23.  Then  MTTF=X( k)'  Y( k).  Because  X{k) 
decreases  with  k  and  Y( k)  increases  with  k,  there  is  an  optimal 
value  of  k  which  can  maximize  MTTF{k)  =  X(k)  ■  Y(k)  for  a 
given  FTPA  structure.  Figure  4  provides  an  intuitive  idea  of 
how  the  MTTF  of  FTP  As  varies  with  the  number  of  spares. 
The  dotted  lines  describe  the  MTTF  of  the  arrays  with 
JV  s=  400  and  dashed  lines  correspond  to  the  MTTF  of  the 
arrays  with  N1  =  100.  The  survivability  of  the  arrays  is  S  =»  1. 
It  is  clearly  shown  that  the  MTTF  increases  with  the  number 
of  spares  first,  and  then  after  some  point,  it  starts  to  decrease 
moootonically.  F or  example,  consider  an  array  with  a,  =  10_l 
and  N1  =  400;  the  MTTF  is  optimized  only  when  the  array 
takes  ife  =»  38  spare  PEs. 


It  is  clear  that  the  optimal  number  of  spares  ONS  of  the 
FTPA  is  the  minimum  number  of  k  which  satisfies  the  follow¬ 
ing  inequality 


X(k)Y(k)>X(k+l)Y(k+l)  .  (24) 


This  is  because  before  and  after  the  optimal  number  of  spares, 
the  MTTF  increases  and  decreases  monotonicaily,  respectively. 
Unfortunately,  it  is  very  difficult  to  find  the  accurate  solution 
of  the  inequality  in  (24)  if  a  numerical  method  is  not  used. 
Some  further  simplification  of  inequality  24  is  discussed  in  the 
rest  of  the  section  based  on  which  a  simple  estimation  of  the 
ONS  is  obtained. 


As  indicated  in  Section  2^  the  reliability  of  non-redundant 
hardware  is  Rn  ■»  =  e"^  +*  *'A<  and  the  reliability  of  a 

simplex  array  without  any  redundancy  (4  —  0)  is  R  «• 

Because  the  overall  reliability  of  an  FTPA  is  the  product  of  the 
reliabilities  of  redundant  hardware  and  non-redundant 
hardware,  i.e.,  R  =»  R,  ■  B„  one  necessary  condition  for  an 
FTPA  to  achieve  a  better  reliability  than  that  of  the 
corresponding  simplex  array  with  the  same  logic  number  of 
PEs  is  that  the  total  ares  of  non-redundant  hardware  in  the 
FTPA  should  be  much  smaller  than  the  total  area  of  the  redun¬ 
dant  hardware  of  the  array,  i.e..  A,  =  (AP-Hfc)a,  «  .V.  If  the 
area  of  the  non-redundant  hardware  of  an  FTPA  is  equal  to  or 
comparable  with  the  area  of  the  redundant  hardware  in  the 


189 


MTTF 


Figure  4:  The  variation!  of  MTTF  of  the  FTPAa  with  different 
values  of  the  parameters  are  shown  by  dotted  and  solid  lines. 
The  points  marked  by  x  are  values  of  MTTF  of  the  arrays  with 

rtum  >er  of _  spares  estimated  by 

*  -  /V(,V''*'VN,-a„W  -  1)  -  2[. 


array,  the  reliability  of  that  FTPA  cannot  be  better,  or  it  could 
he  even  worse  than  the  reliability  of  the  corresponding  simplex 
array  no  matter  what  fault-tolerant  scheme  is  used.  From 
these  discussions,  it  can  be  concluded  that  one  constraint  in 
FTPA  design  is  a, « 1.  So,  it  is  reasonable  to  say  that  in 
moat  cases,  integers  B(k)  =  (/V+i)a, 


and  B(t+1)  = 
a„l  have  the  same  value.  Then  the  inequality  in  (24) 


can  be  approximated  as 


(i-n 


i> 


y-i- 
■Ji 

5 

,N(jV+fc+l)a, 


3 

(i-Q 

•-I 


■V-14-i 
iV-Hfc+i -t-1 


and  we  have 

3  V» _ t  -i-t 


>  — 


or 

n  (ivu-k+i^.)  >  (s-^i)  n  (iV-i-«) .  (25) 

t-i  i—i 

Notice  that 

Q(<7  -  j)  =  (C  -u  l|(C  -  2)...(C  -  /  -  I)(C  7) 

j-i 


can  be  written  as 

m<7 + »  -  (c + 1)(<? + m{c + axe + /  - 1))... 

i-i 


or 


ri(  <?+>M(  c+^±)*  -( i^-)2)x((  c^ii 


If  C  »  (/— 1)/2,  it  can  be  approximated  by 

n(<74.y)=:(C7+i±i)2(£7-i^-)2((7+i±i.)2...  (25.) 

j-i  * 

As  mentioned  before,  the  area  of  the  non-redundant  hardware 
is  much  smaller  than  the  total  area  for  PEs,  i.e., 
y»B>(B~ l)/2.  So,  by  (25a),  inequality  (25)  can  be  approx¬ 
imated  as 

(!V'+*-l-l+((B+l)/2))a>(B-*-l)(iV-l+((fl+l)/2))fl  . 


Then  the  value  of  k  is 


k 


(A/»4JL_i)(  M  -  1)  -  2 


Because  ;V  25 >  B  and  iV+((fl— 1)/2)  =:  iV,  k  can  be  approxi¬ 
mated  as 

k  =  [iV(  1  -  1)  -  2]  •  (28) 

If  it  is  assumed  that  iV  »  k  so  that  B  ns  ,Va„  then  a  simpler 
estimation  of  the  ONS  is 

*  -  [n*(  ^yy-a^l  -  D-  2]  •  (27) 

This  formula  shows  that  the  value  of  the  ONS  of  a  given  FTPA 
structure  is  independent  of  the  failure  rate  of  a  single  PE  X.  It 
also  shows  that,  for  the  FTP  As  which  have  constant  survivabil¬ 
ity  S,  the  value  of  ONS  is  independent  of  the  value  of  S. 

The  acceptable  accuracy  of  the  estimation  of  the  ONS  by 
(27)  is  demonstrated  in  Figures  4  and  5.  Figure  4  shows  the 
FTPAa  with  a  relatively  small  optimal  number  of  spares,  and 
Figure  5  is  for  the  FTPAs  with  a  relatively  large  optimal 
number  of  spares.  The  dotted  and  the  dashed  lines  show  exactly 
how  the  MTTF  of  the  FTPAs  varies  with  the  number  of 
spares.  These  curves  are  obtained  according  to  Equation  18  by 
computer  programs.  The  points  marked  by  x  are  the  values  of 
the  MTTF  corresponding  to  the  estimated  optimal  number  of 
spares  obtained  according  to  (27).  Figures  4  and  5  show  that 
the  estimation  of  the  ONS  by  (27)  is  quite  close  to  the  exact 
value  of  the  optimal  number  of  spares  required  for  different 
FTPAs  and  the  estimation  by  (27)  is  acceptable. 

One  assumption  which  makes  possible  to  estimate  the 
NNS  and  ONS  by  Equations  14.  16  and  27  is  that  the  surviva¬ 
bility  of  the  fault-tolerant  scheme  can  be  approximated  by  a 
constant  S  so  that  (4)  provides  a  good  reliability  model  of  the 
redundant  hardware  R„  Some  discussions  of  the  cases  where 
the  survivability  j,-  is  not  close  to  a  constant  can  be  found  in 
TOi.  These  estimates  obtained  by  Equations  14,  16  and  27  can 
be  readily  used  by  FTPA  designers  instead  of  the  time- 
consuming  trial-and-error  computations  of  MTTF  for  many 
different  possible  numbers  of  spares.  They  may  not  be  as  accu¬ 
rate  as  estimates  computed  through  detailed  modeling  and 
sophisticated  software  tools  such  as  in  ’6|,  ,7|  and  others. 
These  toois  can  capture  failure  dependencies,  the  effect  of  faults 
of  links  and  control  mechanisms  and  ocher  details,  but  require 
time-consuming  model  construction  and  computer  runs,  partic¬ 
ularly  if  different  reconfiguration  schemes  and  numbers  of 


190 


MTTF 


k 

Figure  5:  The  variations  of  MTTF  of  the  FTPAs  with  different 
values  of  the  parameters  are  shown  by  dotted  and  dashed  lines. 
The  points  marked  by  v  are  values  of  MTTF  of  the  arrays  with 
the  number _ of  spares  estimated  by 

k  =  J,vrv'-**v.v-«,  -  i  -  i)  -  2). 


References 

jl]  A.  L.  Rosenberg,  "The  Diogenes  Approach  to  Testable 
Fault  Tolerant  Arrays  of  Processors,"  IEEE  Trans,  on 
Computers,  Oct.  1983,  pp.  902-910. 

(2j  M.  Sami  and  R.  Steffanelli,  "Fault-Tolerance  and  Func¬ 
tional  Reconfiguration  in  VLSI  Arrays,"  Int'l  Conf.  on  Cir¬ 
cuits  and  Systems,  1986. 

[3}  A.  Papoulis,  "Probability,  Random  Variables,  and  Sto¬ 
chastic  Processes,"  McGraw  Hill,  New  York,  N.Y,  1984. 

4|  M.  Chean,  and  J.A.B.  Fortes,  "The  _F all  Use  of  Suitable 
Spares  (FUSS)  ADproach  to  Hardware  Reconfiguration  for 
Fault-Tolerant  Processor  Arrays,"  IEEE  Trans,  on  Com¬ 
puters,  April  1990,  pp.  564-571. 

5j  A.P.  Reeves,  "Fault  Tolerance  in  Highly  Parallel  Mesh 
Connected  Processors,"  in  "Computing  Structures  for 
Image  Processing,"  Academic  Press.  1983,  Chapter  6. 

SI  N.  Lopex-Benitet,  and  J.A.B.  Fortes,  "Detailed  Modeling 
of  Fault-Tolerant  Processor  Arrays,"  19th  Int’l.  Symp. 
FTCS,  June  1989,  pp.  545-552. 

7!  A.M.  Johnson,  and  M.  Maiek,  "Survey  of  Software  Tools 
for  Evaluation  Reliability,  Availability,  and  Serviceabil¬ 
ity,"  ACM  Computing  Surveys,  Vol.  20,  No.4,  Dec.  1988, 
pp.  227-269. 

’8}  D.E.  Knuth,  "The  Art  of  Computer  Programming,"  Vol. 
1,  and  Vol.  2,  1973. 

.'91  M.  Chean,  and  J.A.B.  Fortes,  "A  Taxonomy  for 
Reconfiguration  Techniques  for  Fault-Tolerant  Processor 
Arrays,"  Computer,  Vol.  23,  No.  1,  Jan.  1990,  pp.  55-69. 

10|  Y.X.  Wang,  "Approximate  Estimation  of  Reliability, 
Mean-Time-to-Failure  and  Optimal  Redundancy  of 
Fault-Tolerant  Processor  Arrays,"  Ph.D  thesis,  School  of 
Electrical  Engineering,  Purdue  University,  W.  Lafayette, 
IN  47907,  May  1990. 


spares  must  be  considered.  Ideally,  detailed  modeling  should  be 
used  when  a  few  candidate  designs  have  been  identified  which 
are  very  likely  to  meet  MTTF  requirements.  The  estimates 
provided  in  this  paper  can  be  used  to  identify  these  candidates. 

7.  Conclusions 

An  important  problem  in  FTP  A  design  is  how  to  deter¬ 
mine  the  amount  of  redundancy  needed  to  meet  MTTF  require¬ 
ments  or  to  maximixe  MTTF.  This  is  also  important  when  a 
designer  needs  to  comparatively  evaluate  different  FTP  A 
reconfiguration  schemes.  It  would  be  interesting  to  know  the 
smallest  amount  of  redundancy  (the  least  expensive  design) 
needed  to  achieve  a  desired  MTTF,  or  the  goal  may  be  to  find 
the  best  MTTF  achievable  by  any  scheme  and  to  know  how 
many  spares  are  required  for  this  best  MTTF.  Because 
different  technologies  and  reconfiguration  schemes  can  be 
chosen  to  design  FTPAs.  several  cases  are  considered  in  this 
paper.  The  first  case  is  for  the  FTPA  designs  where  non- 
redundant  hardware  can  be  assumed  to  be  perfect  and  never 
fail  and  any  spare  can  replace  any  faulty  component  regardless 
of  the  fault  distribution.  In  the  ocher  case,  the  failure  of  non- 
redundant  hardware  is  taken  into  consideration.  This  paper 
provides  different  methods  that  can  be  effectively  uaed  to  esti¬ 
mate  the  necessary  number  of  spares  needed  to  attain  a 
prescribed  MTTF  and  the  optimal  number  of  spares  which 
maximise  the  value  of  MTTF  of  a  given  reconfiguration 
tcnerae.  Finally,  the  results  of  this  paper  can  be  readily 
extended  to  modular  systems  other  than  processor  arrays. 


191 


REFERENCE  NO.  15 


Shang,  W.,  O’Keefe,  M.  T.,  and  Fortes,  J.  A.  B.,  "On  Loop  Transformations  for  Gen¬ 
eralized  Cycle  Shrinking,"  1991  International  Conference  on  Parallel  Processing, 
August,  St  Charles,  Illinois,  Vol.  n,  pp.  132-141. 

Note  -  This  paper  shows  how  certain  compiler  optimizations  (generically  denoted  as 
cycle  shrinking  transformations)  can  be  unified  and  generalized  by  linear  schedules 
such  as  those  used  to  schedule  systolic  algorithms. 


1991  International  Conference  on  Parallel  Processing 


On  Loop  Transformations  for  Generalized  Cycle  Shrinking 


Wtijia  Sbang 

Center  tor  Advanced  Computer  Studies 
University  of  Southwestern  Louisiana 
Lafayette,  LA  7  OS  04 
swQcacs.usl.edu 


Matthew  T.  OKttft 
Department  of  Electrical  Engineering 
University  of  Minnesota 
Minneapolis,  MN55455 
okeefeQee.umn.edu 


Jost  A.  3.  Fonts 
School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  IN  47907 
.ortesQee.ecn.purdue.edu 


Abstract:  This  paper  describes  several  loop  transformation 
techniques  for  extracting  parallelism  from  nested  loop 
structures.  One  technique  is  called  selective  cycle  shrinking  and 
the  other  is  called  trut  dependence  cycle  shrinking.  It  is  shown 
how  selective  shrinking  is  related  to  linear  scheduling  of 
nested  loops  and  how  true  dependence  shrinking  is  related  to 
conflict-free  mappings  of  higher  dimensional  algorithms  into 
lower  dimensional  algorithms.  Methods  are  proposed  in  this 
paper  to  find  the  selective  and  true  dependence  shrinkings 
with  minimum  total  execution  time  by  applying  the 
techniques  of  finding  optimal  linear  schedules  and  optimal 
and  conflict-free  mappings  proposed  in  [13]  and  [IS], 
respectively. 

1.  Introduction 

This  paper  describes  several  loop  transformation 
techniques  for  extracting  parallelism  from  nested  loop 
structures.  This  task  is  often  performed  by  optimizing  and 
parallelizing  compilers  that  have  as  their  goal  the 
transformation  and  mapping  of  a  serial  program  into  a 
parallel  form  that  can  be  executed  on  a  particular  architecture 
[17],  Nested  loop  structures  offer  the  most  fruitful  sources  of 
parallelism  in  serial  programs,  and  it  is  therefore  of 
paramount  importance  that  the  analysis  necessary  for  such 
parallelization  be  both  precise  and  efficient. 

Algorithms  under  consideration  in  this  paper  are  nested 
loops  with  regular  data  dependence  structures.  Such 
algorithms  can  be  modeled  by  a  uniform  dependence  algorithm 
(J,  D)  [13].  Set  /  is  the  index  set  or  iteration  space.  Each 
element  in  /  is  an  n  -tuple  integral  column  vector  (called 
iteration  vector  or  index  vector).  Matrix  D  is  the  dependence 
matrix  where  each  column  is  a  dependence  vector.  If 
computation  in  one  iteration  depends  on  the  computation  in 
another  iteration,  this  dependence  is  represented  by  the 
vector  difference  of  the  iteration  vectors  corresponding  to 
these  two  iterations.  All  dependences  are  assumed  to  be 
uniform  which  means  the  dependence  relation  exists  between 
two  iterations  as  long  as  the  vector  difference  between  these 
two  iterations  is  equal  to  the  vector  representing  that 
dependence  relation.  This  algorithm  model  can  be  used  for 
algorithms  with  nested  loops  and  cyclic  dependence 
structures  (dependence  cycles)  described  in  [11]  and  is  easily 
related  to  models  used  in  [5],  [10].  Uniform  dependence 
algorithms  can  be  used  to  model  a  class  of  practical 
algorithms  in  signal  and  image  processing  [4]. 

In  [11],  three  loop  transformation  techniques  —  simple 
cycle  shrinking,  seletttve  cycle  shrinking  and  true  dependence  cycle 


This  research  was  supported  in  part  by  the  National 
Science  Foundation  under  Grant  DC1-3414745  and  in 
part  by  the  Innovative  Science  and  Technology  Office  of 
the  Strategic  Defense  Initiative  Organization  and  was  ad¬ 
ministered  through  the  Office  of  Naval  Research  under 
contract  No.  0001 4-3 8-k-0723. 


shrinking  —  were  introduced  to  transform  sequential  loops 
into  parallel  loops.  The  concept  of  cycle  shrinking  is  best 
illustrated  by  an  algorithm  with  only  one  loop.  In  this  case, 
each  dependence  vector  has  only  one  entry.  Cycle  shrinking  is 
useful  when  the  minimum  of  the  dependences  o  is  greater 
than  unity,  it  transforms  a  sequential  DO  loop  into  two 
perfectly-nested  loops:  a  sequential  outer  loop  and  a  parallel 
inner  loop.  For  example,  let  a  single  loop  have  »  iterations, 

1. e.,  index  set  /  —  (I,  2,  ...,  «);  then  iterations  in  sets  (I,  2, 
...,  3),  or  [4+1,  ...,  23),  ...  can  be  executed  in  parallel  without 
violating  data  dependence  relations.  If  the  algorithm  has  a 
cyclic  dependence  structure,  then  the  size  of  the  dependence 
cycle  is  shrunk  by  a  reduction  factor  3  which  is  the  number  of 
iterations  that  can  be  executed  in  parallel.  More  examples  for 
the  intuitive  concepts  behind  cycle  shrinking  can  be  found  in 
[11]- 

This  paper  considers  generalized  selective  and  true 
dependence  shrinkings.1  It  is  shown  that  the  execution  order 
of  a  generalized  selective  shrinking  can  be  described  by  a 
linear  schedule  [13];  in  addition,  it  :s  shown  that  true 
dependence  shrinking  can  actually  be  described  by  a  conflict- 
free  mapping  from  higher  to  lower  dimensional  algorithms 
[13].  Methods  are  proposed  to  find  selective  and  true 
dependence  shrinkings  by  applying  the  techniques  proposed 
in  [13]  and  [15].  The  resulting  selective  shrinking  is  optimal 
for  algorithms  with  convex  polyhedron  index  sets,  while  the 
resulting  true  dependence  shrinking  is  optimal  for  algorithms 
with  doubly-nested  loops.  The  resulting  transforms 
outperform  those  proposed  in  [11]  both  in  terms  of  shortest 
execution  time  and  generality.  Also,  it  is  shown  that  selective 
shrinking  describes  wavefront  execution  [16],  [6]  and  the 
time-optimal  wavefront  execution  can  be  obtained  by  the 
method  discussed  in  this  paper  for  selective  shrinking. 

The  paper  is  organized  as  follows.  In  Section  2,  basic 
notations,  assumptions  and  ideas  and  the  algorithm  model 
are  introduced.  Selective  and  true  dependence  shrinking  are 
described  in  Sections  3  and  4,  respectively,  where  the 
generalized  definitions  and  motivations  are  given.  Methods 
to  find  optimal  cycle  shrinkings  arc  also  presented  in  these 
two  sections.  Section  5  concludes  this  paper  and  points  out 
some  future  research. 

2.  Basse  Ideas,  Definitions  and  Motivations 

Throughout  this  paper,  sets,  matrices  and  row  vectors  are 
denoted  by  capital  letters,  column  vectors  are  represented  by 
lower  case  symbols  with  an  overbar  and  settlors  correspond  to 
lower  case  letters.  The  transpose  of  a  vector  7  is  denoted  FV 
The  vector  0  (1)  denotes  the  row  or  column  vector  whose 
entries  are  all  zeroes  (ones).  The  dimensions  of  vector  0  (1) 
and  whether  it  denotes  a  row  or  column  vector  are  implied 

*  3oth  selective  and  true  dependence  shrinking  always  out¬ 
perform  simple  shrinking  as  indicated  in  [11].  Hence,  we  do 
nor  consider  simple  shrinking  in  this  paper. 


FI-132 


1991  International  Conference  on  Parallel  Processing 


by  the  context  in  which  they  ire  used.  £,  denotes  the  row 
vector  whose  entries  are  zero  except  chat  the  itb  entry  is 
unity.  I  denotes  the  identity  matrix.  The  rank  of  matrix  A  is 
denoted  rank(A).  The  set  of  integers,  the  set  of  noa-negative 
integers,  the  set  of  positive  integers  and  the  set  of  real 
numbers  are  denoted  Z,  N,  N*  and  2R,  respectively.  The 
empty  set  is  denoted  0.  The  notations  |C|  and  |<*|  represent 
the  cardinality  or  the  number  of  elements  of  set  C  and  the 
absolute  value  of  scalar  a,  respectively.  Let  7  and  «T  be  two 
vectors.  Then  7^u  means  every  component  of  7  is  greater 
chan  or  equal  to  the  corresponding  component  of  u.  Finally, 
if  x  is  an  element  of  a  set  5,  the  notation  x  S  S  is  used  and 
this  notation  is  also  used  to  indicate  that  a  column  vector 
•or  row  vector  Af,)  is  a  column  (row)  of  a  matrix  Af,  i.e., 
m/  €  Af  (Af,-  €Af)  means  m,-  (Aft  )  is  a  column  (row)  vector  of 
matrix  Af . 

This  work  was  motivated  by  [11]  and  some  of  the  results 
reported  in  this  paper  are  applications  of  work  reported  in 
[13]  and  [IS].  In  [11],  a  special  kind  of  Fort  ran- Like  nested 
loop  was  considered  with  a  dependence  cycle  among  the 
statements.  This  kind  of  loop  has  the  following  form: 

DO/i  -/,  ,  u, 

DO;-!  -/j  ,  «2 

DO  >„-/„, 

SiG) 

5  2(7)  (2.1) 

5,(7) 

END 


END 

END 

Column  vector  J  —  [jif  i,—fn]r  is  the  iteration  vector. 
5,(7),  5,(7),  ....  5,(7)  are  the  p  statements  in  iteration  J. 

Let  5{  .1. 5(  denote  that  statement  depends  on  statement 
5 j  with  distance  d.  In  other  words,  5^J)  depends  on 
5;  (7~d )  for  all  iteration  vectors  J  and  y~d_.  A  dependence 

cycle  exists  in  the  p  statements  if  5,  J.  Si- 

Vectors  dr,  r— 1,  ....  k  are  assumed  to  be  constant  and  non¬ 
zero.  The  tower  and  upper  bounds  of  the  itb  nested  loop, 
!<»<»,  arc  denoted  by/,-  and  respectively.  These  bounds 
were  assumed  to  be  constant  in  [11].  In  this  paper,  it  is 
assumed  chat  /,  and  ■*,  are  constant  and  /,,  »,-  are  linear 
functions  of  indices  jt,  L<4<»— 1,  4—2,  ....  is.  Algorithms 
with  is  nested  loops  are  called  it -dimensional  algorithms.  A 
typical  example  [11]  is  in  Figure  1  (Figure  6a  in  [11])  where 

5,  tlSj  IfS,  and  d,— [3,  3]r  and  d2— [2,  4]r.  This  algorithm 
is  2-dimensional. 

For  the  purposes  of  this  paper,  a  pair  (J,  D)  can  be  used 
to  characterize  the  algorithm  model  in  (2.1).  /  is  the  index  set 
or  deration  spate.  Each  element  in  /  is  a  it -tuple  column 
vector  corresponding  to  one  iteration  of  the  loop.  Often  in 
this  paper,  a  column  vector  in  /  is  called  the  index  pomt  or 
index  teeter.  Matrix  D  is  the  dependence  matrix  with  m 
columns  each  of  which  is  a  dependence  vector  d{,  i  -l,  ...,  m . 
For  example,  the  algorithm,  in  Figure  1  can  be  described  by 
<J,D)  where /  —  (7:  A~<b,JGZ *)  and 


DO  j\  -  3,  «, 

DO  y2  —  5,  ui 

SI:  A(/,,>2)  “  B(/’,  -  3,;2  -  3) 

S2:  —  AO  ,  -  2,yz  '  ■!) 

END 

END 

ll  s2  —  $i,  vhert  dT-  [|],  dT-  [4] 

Figure  1:  Nested  loops  with  cyclic  dependence  structure 
where  5,  depends  on  S2  and  vice  versa. 


[-1  0  j 

0  -I 

—3 

—5 

ui 

[3  2 

A- 

1 

0 

,  4- 

,  D  — [d ,,  dj— 

[5  4 

0 

1 

Uri 

The  pair  (J,  D )  only  captures  the  structural  information 
of  cyclic  dependence  algorithms  in  (2.1).  Finer-grained 
information,  such  as  where  and  when  input/output  of 
variables  take  place,  can  be  ignored  for  the  purposes  of  this 
paper.  Informally,  an  index  vector  in  the  index  set  / 
corresponds  to  a  computation  of  the  algorithm  which 
includes  all  p  statements.  Computation  J  (the  computation 
indexed  by  7  is  often  simply  described  as  ‘‘computation  J”) 
depends  on  computation  y -dh  »'— 1,  ...,  m  if  ~—di  6  J . 
Clearly,  given  a  nested  loop  like  the  one  in  (2.1),  the  pair 
(/,  D)  can  be  constructed.  Herein,  it  is  assumed  that,  the 
letters  is  and  m  always  denote  the  number  of  nested  loops 
(algorithm  dimension)  and  the  number  of  dependence  vectors, 
respectively. 

The  pair  (J,  D )  can  also  be  used  to  characterize  a  more 
general  class  of  algorithms  called  uniform  dependence 
algorithms  whose  detailed  definition  can  be  found  in  [13]. 
The  algorithm  model  in  (2.1)  can  be  transformed  into  a 
uniform  dependence  algorithm  and  all  results  and  techniques 
presented  in  this  paper  apply  to  both  uniform  dependence 
algorithms  and  the  algorithm  model  in  (2.1).  Uniform 
dependence  algorithms  occur  frequently  in  signal  processing 
and  scientific  computing  applications.  Examples  include 
matrix  multiplication,  LU  decomposition,  and  convolution. 

In  this  paper,  only  the  bounds  of  the  outermost  loop 
need  to  be  constant;  the  upper  and  lower  bounds  of  the 
inner  loops  can  be  functions  of  outer  loop  indices.  This 
paper  considers  convex  polyhedron  index  sets.  Examples  of 
such  index  sets  include  trapezoidal,  square  and  triangular 
index  jets.  For  2-dimensional  algorithms,  these  index  sets  are 
shown  in  Figure  2.  The  convex  hulls  [1,  pp.  35]  of  these 
index  sets  can  be  described  by 

R-\x:Axdb,A  SZ*~JeZ9,xS]Rm,at}r)  (2.3) 
and 

/  S  (7:7€  R  A-6  Z*  }  (2.4) 

where  A  is  called  the  index  constraint  matrix  of  J;  vector  b  is 
called  the  situ  vector,  R  is  called  the  convex  hull  of  index  set 
/.  Since  R  is  the  convex  hull  of  index  set  J,  extreme  points 
of  polyhedron  R  are  always  integers  and  belong  to  the  index 
set  /.  Finally,  different  size  vectors  correspond  to  instances  of 
the  same  algorithm  that  differ  only  in  their  sizes  (but  have 
the  same  shape). 


a-133 


Figure  2:  2 -dimensional  convex  poiyhedra. 

Figure  3a  shows  the  index  set  /  of  the  algorithm  in 
Figure  1.  Each  point  corresponds  to  a  computation  or  an 
index  point.  Dependence  relations  exist  among  different 
computations  or  points.  The  convex  hull  of  this  index  set  is 
simplv  a  rectangle  which  is  described  by  A— Jtjr: 
Ax  <£,  T€iRI).  Matrix  A  and  the  size  vector  b  are  defined  in 
Equation  2.2.  For  different  values  of  u1  and  u2  of  size  vector 
b,  the  algorithm  has  its  different  instances. 


d,  / 


:  7 


r#  »•••••• 


(a)  The  index  set  of  the  algorithm  in  Figure  l. 


<!>  $  <T> 


d  i  d  ? 


2(*i— 4)+4  3(i*2-4)+5  („t-2)(i*j-4) 


(b)  New  1-dimensional  index  set  after  true  dependence 
shrinking. 

Figure  3<  True  dependence  shrinking. 


3.  Generalized  Selective  Shrinking 

In  this  section,  selective  cycle  shrinking,  introduced  in 
[11],  is  described  informally  in  order  to  motivate  generalized 
scUctnt  shrinking,  which  employs  a  linear  transformation  to 
order  algorithm  execution.  A  method  for  determining 
optimal  generalized  selective  shrinking  is  presented,  as  well  as 
techniques  for  transforming  the  original  source  code 
according  to  optimal  shrinking. 

3.1.  Selective  shrinking  and  generalized  selective 
shrinking 

For  an  n  -dimensional  algorithm  (J,D),  let  the 
dependence  matrix  D— [d,y],  i— 1,  ....  »  and  >-l,  ...,  m  and 
<5,-minU,y:  /-l,  ...,  »»},  i-l,  ...,  n.  According  to  [11], 
selective  shrinking  starts  from  the  outermost  loop,  i.e.,  at  i-l 
and  considers  each  i,-,  i— 1,  ...,  n  in  turn.  When  the  first 
positive  number  $;  is  met,  the  process  stops.  All  loops  nested 
inside  the  itb  loop  are  transformed  into  parallel  DOALL 
loops  which  means  all  iterations  inside  the  loop  can  be 
executed  in  parallel.  The  i— 1  outermost  loops  stay 
unchanged  and  the  itb  loop  is  sequential  with  a  loop 
increment  6;. 

For  example,  consider  the  algorithm  in  Figure  1  where 

3  2 

D—  g  ^  and  index  set  /  is  shown  in  Figure  3a.  This 

algorithm  has  a  dependence  cycle  because  statement  S2 
depends  on  S2  and  S2  depends  on  Sj.  Clearly,  3I— min{3,  2| 
—  2  and  i2— min{5,  4)  —4.  Because  ij-2>0,  the  outermost 
loop  is  blocked  and  is  executed  sequentially  with  a  loop 
increment  5,-2  and  the  innermost  loop  becomes  a  parallel 
loop  DOALL.  The  parallel  code  transformed  by  the  selective 
shrinking  is  shown  in  Figure  4a.  With  this  parallel  code,  if 
there  are  enough  processors  available,  index  points  [3, y2]r, 
(4, 7J]T,  /2-5,  ....  u2  can  be  executed  in  parallel  at  time  step 
0,  and  index  points  [S,yjr,  [6, ,/2]r>  / 2-5,  ....  »2  can  be 
executed  in  parallel  at  time  step  1.  ete.  The  reduction  factor  is 
e— 2(m2— 4)  which  equals  the  number  of  computations 
executed  in  parallel  at  one  time  step.  Pictoriailv,  if  a  set  of 
hyperplanes  (lines  in  this  two-dimensional  algorithm) 
[1,  0]J— c,  t  —  3,  ....  ux  are  drawn  in  the  index  set  as  shown  in 
Figure  4b,  then  ail  the  points  lying  on  two  consecutive 
hyperplanes  [1,  0]~-c  and  [1,  0]  J-e+l  (e.g.,  [1,  0]J-3  and 
[1,  0]j-4)  can  be  executed  in  parallel  according  to  the 
selective  shrinking.  This  execution  order  can  be  described  by 

mapping  »n,m( T)  ”  -  such  that  computation  J  is 

« 

executed  at  time  step  Computations  indexed  by 

vectors  Ji  and  tan  be  executed  in  parallel  according  to  the 
code  in  Figure  4a  if  and  onbf  if  <rp0](J|)  -  <rp,o)(Ji)-  Clearly, 
this  execution  order  respects  the  dependence  relations.  For 
example,  computation  [5, 9]£  depends  on  computation 
[3,  5]'  by  dependence  vector  d2.  This  dependence  relation  is 
respected  by  selective  shrinking  because  computation  [3,  5]r 
is  scheduled  to  execute  (at  t  — 0)  before  [5,  9]r  (at  time  r— 1). 
In  fact,  the  dependence  relation  is  respected  because  the 
inequality  [I,  0]D  >0  is  satisfied. 

At  point,  some  interesting  observations  can  be 
made.  Pint  there  are  other  possible  mappings  <r  by  which  the 
total  execution  time  may  be  shorter  than  the  mapping  <rp<0]. 
The  total  execution  time  bv  <rp  0]  is 

_  -tzii  Howw;?  *  ^ 

2  2 


11-134 


1991  International  Conference  on  Parallel  Processing 


DO  jt  -  3,  uu  2 
DOALL  ]\  -)Ws+l 
DOALL  jt  -  3,  u  j 
SI:  AOW  j)  "  BO  !  -  3,  y'j  -  5) 
S2:  BO  i.  Ji)  m  AO  i  -  2,  )  j  -  4) 
ENDOALL 
ENDOALL 
ENDO 

(a).  Parallel  code  alter  selective  shrinking. 


h 


nr-3  4  5 


(b).  Pictorial  description  of  the  execution  order  of  selective 
shrinking. 


Figure  4t  Selective  shrinking. 


mapping  <r;(UI 


execution  time  ;s 


1  ([0,  l]7_5)/4  is  used,  then 

the  total 

‘[0.1]([3,*,Jr-[3,5]r)+l 

Uj-4 

* 

4 

Note  that  mapping  alao  respects  the  dependence 

relation  because  [0, 1]D  >0  [13].  Clearly,  when  u-  <2*[,  the 
total  execution  time  of  mapping  is  shorter  than  the 
mapping  <rp  0|  found  using  the  original  selective  shrinking 
technique^  [11]. 

A  second  observation  can  be  made  from  the  example 
algorithm  given  in  Figure  5,  with  dependence  vector 
d— (0,  l]r.  In  this  case,  the  original  selective  shrinking 
technique  fails  to  obtain  any  speedup  because  mapping 
which  describes  the  execution  order  of  the  algorithm  after 
selective  shrinking,  does  not  respect  the  dependence  relation 
([l,0]d  -  0)  [13].  However,  other  feasible  mappings  with 
significant  reduction  factors  exist  for  this  algorithm.  One 
possible  mapping  is  “  [0,  IJf  with  reduction  factor 
t-*:.  The  execution  order  determined  by  mapping  is 
pictoriaily  shown  in  Figure  3  where  the  execution  waveftont 
is  described  by  dashed  hyperplane  [O.ljf  -  c  and  all 
computations  on  u.e  same  hyperpiane  [0.  l]J"r  are  executed 


1  If  the  values  of  or  a,  are  not  knowe.  until  run-tune, 
this  analysis  could  be  used  to  generate  a  run-tune  teat  to 
choose  the  optimal  mapping  at  run-tune  based  on  the  relative 
values  of  and  This  concept  it  similar  to  that  proposed 
in  [3]  for  mmhtfU  si wane  Issps 


in  parallel. 


nj- 4 


3 


1 


*  :  :  o  J  i 

5  “i 

Figure  3:  An  example  in  which  the  selective  and  true 

dependence  shrinking!  do  not  get  speedup  or  are  not 
feasible. 


These  two  observations  indicate  that  selective  shrinking 
limits  the  full  exploitation  of  the  parallelism  in  nested  loop 
structures  and  motivates  the  definition  of  generalized 
selective  shrinking  as  follows. 

Definition  3.1  (generalized  selective  shrinking):  For 
algorithm  (/,  D)  a  gmtrnlittd  seUcttv*  shrinking  is  a  mapping 
<ra:  /— \Y  such  that 

*n(7)-  [  (n-  +  r)/ds»pn  J.  ”6/  (3.1) 

where  RiZlm,  rfupH-ws*  (nd;:  dj€D)>0, 

g(di{xx,...,rn)—l  and  c «•  -min{nj":  J’€/J.  The  row  vector  H 
is  called  the  stbtdni*  vtcror  specifying  <ra. 


Generalized  selective  shrinking  as  defined  in  Definition 
3.1  respects  the  dependence  relations.  This  means 
computation  J  is  executed  only  after  the  execution  of 
computations  i— I,  ....  m ,  upon  which  computation 

J  depends.  This  follows  from  the  constraint  ILD  >0.  For  any 
algorithm  (/,  D ),  the  total  execution  time  by  the  generalized 
selective  shrinking  ra  is  as  follows: 


t— 


max(n0i-~2):J't,726/)4-i 


dispR 


+1 


(3.2) 


maxtn  (7i-~a):7i.7^7) 

dispR 

Because  total  execution  time  r  is  minimized  iff  the  part  inside 
the  floor  function  is  minimized,  the  problem  of  finding 
optimal  generalized  shrinking  can  be  formulated  as  follows: 

&ax  to(7  -7 jyJxJiZJ) 

cainlRdr.dj  € D) 

(3.3) 

(1)  HD>0 

(2)  gtd(TU...,rn)-l 


min/* 


sukjttt  t» 


The  optimal  solution  of  the  above  problem  is  denoted 
O'.  Notice  that  Of  is  optimal  only  for  the  kind  of  schedule 
defined  in  (3.1).  It  is  possible  chat  there  exist  other  schedules 


} 


ged(«,, *.) 


the  greatest  common  divisor  of  »i. 


n-133 


1991  International  Conference  on  Parallel  Processing 


with  ihotter  execution  time  than  H*.  In  [14],  a  das*  of 
algorithms  is  identified  where  the  optimal  linear  schedule  is 
optimal  for  all  kinds  of  schedules.  The  schedule  vector 
describing  the  execution  order  of  the  selective  shrinking 
defined  in  [11]  for  algorithms  where  —  tninUn, 
d,„]>0  is  II— [1.  0, 0],  For  the  algorithm  in  Figure  4,  the 
schedule  vector  for  the  selective  shrinking  is  EL— £1,  0]. 
Because  [l,0][dt,  dj— [3,  2],  this  schedule  vector  is  feasible 
and  respects  the  dependence  relation.  However,  it  is  not 
necessarily  the  optimal  one. 


3.2.  Finding  optimal  generalized  selective  shrinking 

In  [13],  a  procedure  is  proposed  to  find  the  optimal 
schedule  vector  H*  for  any  uniform  dependence  algorithms 
with  convex  polyhedron  index  sets.  To  avoid  further 
complicated  terminology  and  definitions,  that  are  needed  for 
a  formal  presentation  of  the  procedure  for  any  uniform 
dependence  algorithm,  only  the  procedure  for  uniform 
dependence  algonthms  where  the  loop  bounds  are  constant 
and  the  procedure  for  2-dimensional  algorithms  with  any 
convex  polyhedron  index  sets  are  presented.  The  procedure 
for  n  -dimensional  algorithms  with  any  polyhedron  index  sets 
and  the  proofs  of  the  correctness  of  these  procedures  can  be 
found  in  [13]. 

An  index  set  is  called  conitunt-'ooMndtd  if  all  loop 
bounds  of  the  associated  algorithm  are  constant,  i.e., 

/-((/ . Jm ) T  ■  h  </,  €-?.»  “I . *  J  •  (3-4) 


Let  D(cx...ctlrx...ry)  denote  the  submatrix  of  O  containing 
the  elements  in  columns  cb  ...,  cM  and  rows  r.,  ...,  ry,  i.e.,  it 
contains  the  elements  of  O  at  the  intersections  of  columns 
C[,  ....  c.  and  rows  rt,  ...,  r,.  If  D(cv..cklrx...rk)  is 
nonsingular,  anjnteger  row  vector  V-(v,,...,pi]€Z1"*  is 
defined  as  V-d\D~l(cv..cklrx...rk)  where  d  is  a  positive 
integer  such  that  gcd(v  x,...,vk)-> 1.  In  ether  words,  V  is  a 
vector  whose  entries  are  the  sums  of  the  corresponding 
columns  of  D~l(cx...ck/rx...rt)  sea»cd  so  that  they  are  integers 
with  the  greatest  common  divisor  equal  to  unity.  If 
D(ex...eklrx...rk)  is  nonsingular,  then  define 

l£-.l 


II(rl...c»/rl...rt)-FB, 


where  3 


and  Ei  is  as  defined  in 


the  first  paragraph  of  Section  2.  In  words,  the  subvector 

[»,, . r,,]  of  II(<1...ft/rI...ri)  is  the  same  as  V  and  the 

remaining  entries  of  TL(cx...cklrx...rk)  are  aero.  Finally,  C 
denotes  the  set  {TL(tx...etJrx...rk): 

1  <4  <» .  1  <<  ,  <...  <ct  ,  1  x  <...  <rt  <»  ] .  The 

following  example  illustrates  the  notations  and  concepts  just 
introduced  followed  by  a  theorem  which  states  that  the 
optimal  solution  n*  is  in  C  and  :.  procedure  which 


constructs  the  candidate  set  C. 


Example  3.1:  Consider  the  algorithm  (/,  O)  in  Figure  1 
where  dependence  matrix  O  is  as  shown  in  Equation  2.2. 
According  to  the  definitions  just  introduced,  0(1/1)— [3] 
(contains  the  entry  in  the  fim_column  and  the  first  row  of 
O),  its  corresponding  V— <J10-l(l/li— [1]  where  d— 3, 
B— £,— (1,  0]  and  _  11(1/1)— FB—{1.  0];  0(1/2)— £5],  its 
corresponding  V— dlO~l(l/2)— [1]  where  d— 5,  -3— £j— (0,  1] 
and  11(1/2)— V3—{0.  1];  and  0(12/l2)-0,  its  corresponding 
V— dLD_1-(-l,  1]  where  d-2,  3-1  and 

H(12/12)-VB-/-(-l,  I],  The  candidate  set  C  -  ([1,0], 

[0,i],  [-1,1]].  a 


Theorem  3.1  [13]:  If  the  index  set  /  of  algorithm  (/,  O)  is 
constant-bounded  as  defined  by  (3.4),  then  the  optimal 
solution  H*  to  problem  (3.3)  belongs  to  C,  i.e., 

,TeC-tn(r1...f»/r[...rt):  1<4  <*, 

!<<,<...  <ck  <»»,  l<rt<...  <rk  <»)  (3.5) 


Proof:  See  [13]. 


Procedure  3.1  (construction  of  candidate  set  C): 

Input:  Algorithm  (/,  D)  ~here  /  is  constant-bounded. 
Output:  A  finite  candidate  set  C  containing  -he  optimal 
solution  H* . 

Step  1:  4— 1,  l— 1,  C— a. 

Step  2:  Ck—  a. 

Step  3:  Pick  an  unprocessed  combination  of  4  dements 
from  (1,  ....  »).  Denote  it  (r,,...,rt)  where 

r,  <...  <rk. 

Step  4:  Pick  an  unprocessed  combination  of  4  elements 
from  (1 .  m}.  Denote  it  (c  1,...,t*}  where 

«!<•••  <‘f 

Step  5:  If  D(cx...et/rx...  ,)  is  not  singular,  then 

II,— II(<l...rt/rl.../'t)  and  Ck-Ck  U  (n,),  1—1+1. 

Step  6:  Check  if  ail  distinct  combinations  of  4  elements 
from  {1,  ...,  m  j  have  been  processed.  If  not,  go  to 
Step  4. 

Step  7:  Check  if  all  distinct  combinations  of  4  elements 
from  (1,  ....  »)  have  been  processed.  If  not,  go  to 
Step  3. 

Step  8:  If  4  <r*»4(0),  then  k— 4+1,  go  to  Step  2. 

rmmS(O) 

Step  9:  C-  U  Ck.  Stop. 

The  complexity  of  Procedure  3.1  is  bounded  above  by 
0(  %  (l)(?)*j).  If  nw*(0)-u-w.  then  2  (*)<?) 

.  *~l  .  *“l 
-2  Cl)1  <(  2  (l))1  -(2“)J-2>  and  the  complexity  of 
*-i  *  *- 1  .  . 

Procedure  3.1  is  bounded  above  by  0(2"**).  As  indicated 
in  [13],  this  upper  bounder  on  the  complexity  is  loose  and  in 
practical  algorithms,  *  is  small,  e.g.,  n—2  for  the  algorithm 
in  Figure  1.  Next,  an  example  is  used  to  illustrate  Procedure 
3.1. 


Example  3.2:  Consider  the  algorithm  in  Figure  L. 
According  to  Steps  1  and  2  in  Procedure  3.1,  let  4—1,  /—I 
and  Ct-a.  For  Steps  3  and  4,  there  are  two  possible 
combinations  of  one  element  (4—1)  from  (1, ...,  n)  and  two 
possible  combinations  of  one  dement  from  (l,...,t»j 
(n-m-2).  So  in  Step  5,  .0(1/1),  0(1/2),  0(2/1)  and 
0(2/2)  are  processed.  Each  corresponding  vector  is  obtained 
as  follows.  0(1/1)— [3]  and  H(l/1)— £1,  0];  0(2/l)-{2], 
H(2/l)-(l,  0];  0(1/2)— [5]  and  n(l/2)-{0.  1];  0(2/2)-{4] 
and  H(2/2)-{0,  1].  Cx  -  ([1,  0],  [0, 1]).  For  4-2,  there  is 
only  one  combination,  i.c.,  0(12/12)—  O  and  11(12/12)— (-1, 
1].  c,  -  ([-1, 1]J.  So  C-{[1,  0],  [0. 1],  [-1, 1])  -  (n„  n2, 

HjJ.  Every  vector  in  C  is  feasible  since  IIiO— [3,  2], 
HjO-f 5,  4],  and  II jO  —[2,  2].  The  total  execution  time  by 
each  candidate  is  as  follows. 


r(n,)- 

rdlj)— 


U-136 


1991  International  Conference  on  Parallel  Processing 


[-I.l]([3,»2]r-[*i,5]r)+l 

*,+2*j—  7 

2  | 

2 

With  little  thought,  it  can  be  seen  that  no  matter  what  values 
«*  i  and  u2  take.  Hi  and  Hj  have  shorter  execution  time  chan 
rij.  When  vector  Hi  has  the  minimum  execution 

'ime  and  should  be  the  opcimal  schedule  vector  and  if 
«,>2t<i.  III  should  be  the  optimal  schedule  vector.  □ 

For  2-dimensional  algorithms  with  convex  polyhedron 
index  sets  (examples  of  such  index  sets  are  shown  in  Figure 
2),  let  C'  denote  the  candidate  set.  According  to  [13],  C' 
contains  all  vectors  II  which  (1)  are  perpendicular  to  one  of 
nc_bojandarv  lines  of  the  index  set  or  (2)  satisfy  equation 
H(d,—  d*)—  0  where  d,  and  d,  are  two  linearly  independent 
dependence  vectors  of  the  algorithm.  As  defined  in  Equation 
2.3,  let  A/  and  o,  be  a  row  of  the_  index  constraint  matrix  A 
and  an  entry  of  the  size  vector  b,  respectively,  then  AJ— 4,- 
defines  a  boundary  line  of  the  index  set  J.  Vectors  which  are 
perpendicular  to  that  boundary  line  arc  described  by  «A,- 
where  a  is  a  non-zero  constant.  Thus,  C '  contains  these  rows 
of  A  each  of  which  is  linearly  independent  of  any  other  rows 
of  A  in  C.  For  example,  if  A  has  two  rows  [1,  0]  and 
[—1,  0],  then  only  one  of  them  is  included  in  C'  because  they 
really  correspond  to  the  same  schedule  vector.  The  following 
procedure  summarizes  how  to  construct  the  candidate  set  C' 
of  2-dimensional  algorithms  with  polyhedron  index  sets. 

Procedure  3.2  (construction  of  C‘  for  2-dimensional 
algorithms): 

Input:  Index  constraint  matrix  A  and  dependence  matrix 

D. 

Output:  A  finite  candidate  set  C  containing  the  optimal 
solution  II* . 

Step  1:  4—1, 1  — l,  C— <8. 

Step  2:  If  Ak  satisfies  the  following  two  conditions  (I) 
linearly  independent  of  any  other  vectors  in  C'  and 
(2)  either  AJlD>0  or  -At£>  >0,  then  either  0,-ctA* 
(if  Ak2?>0)  or  a,— <*Ak  (if  — Aki)>0),  C'  — 
C’U  {n,],  where  a  is  such  that  the  greatest  common 
divisor  of  entries  of  II/  is  unity  and  II/  is  integral, 
and  1*1+1. 

Step  3:  4—4+1.  If  all  rows  of  A  have  not  been  considered, 
go  to  Step  2. 

Step  4:  Pick  an  unprocessed  combination  of  two  linearly 
independent  dependence  vectors  d;  and  dk  form  D . 
Lct_  n[  be  the  integral  solution  of  equation 
n(df-d,)— 0  such  that  the  greatest  common  divisor 
of  its  jentries  is  unity.  If  H‘  is  feasible  (either 
H'D> 0  or  -n'D>0),  then  11,-11',  C'-C'U{n,}, 
and  1—1+1 

Step  5:  If  all  distinct  combinations  of  two  linear 
independent  dependence  vectors  are  processed,  stop. 
Otherwise,  go  to  Step  4. 

The  procedure  of  finding  the  optimal  solution  II*  for 
n  -dimensional  algorithms  and  the  formal  proof  of  the 
correctness  of  these  procedures  can  be  found  in  [13].  The 
execution  of  the  generalized  selective  shrinking  can  be  coded 
using  parallel  constructs  which  is  discussed  next. 

3.3.  Code  Conversion 

After  the  optimal  schedule  vector  n*  is  found,  the 
parallel  version  of  the  code,  using  DOALL  or  DOACROSS 
[3]  constructs,  can  be  generated  from  the  original  serial  code 


as  explained  next. 

Let  the  original  serial  program  be  as  in  (2.1),  the 
optimal  schedule  vector  Q*  —  [**,,  t,,  ....  r„]  and  dupUf  be  as 
defined  is  in  Definition  3.1.  Then  the  parallel  code  using 
DOALL  construct  is  as  follows: 

/-min  (II*  7  J  €  /] 

»-tnax{n*~:  J  €  /) 

DO  k—l,  it,  dispn.* 

DOALL  (7  n*7-4,  4+1 . i+dispn'-l) 

Si(7> 


5,(7) 

END 

END 

The  statement  DOALL  (7  IT~— 4,4+1,  ....  4+dupH*— 1) 
corresponds  to  the  parallel  execution  of  ail  iterations  J  such 
that  11*7-4,  4+1,  ....  or  4+dtipH*-l.  For  the  algorithm  in 
Figure  1,  if  i*2<2»,,  the  opumai  scheduie_  vector  is  IT  — 
[0, 1]  and  diipH*  —  min  (IT  if,-:  d,  6  D)  —  It'd,  **  4.  The  serial 
code  in  Figure  1  can  be  transformed  into  the  following 
parallel  code*. 

1— min {H* 7- 7  e  /K0,L][3,3f-5 
«-max(H*7  7  g  /)-[0,l][3,»  Jr— 

DO  4—5,  u 2,  4 

DOALL  (7:  11*7-4,  4+1,  4+2.  4+3) 

SLA  ( }\,Ji)-B  (/t— 3,7 2-5) 

52:  B(jl,jd-A(jl-2.j1-4) 

END 

END 

The  DOACROSS  constructs  can  also  be  used  to  express 
the  parallel  code  of  the  algorithm  in  (2.1)  with  optimal 
schedule  vector  II*  —  [**,,  r2,  ....  *■„]. 

DO 

((/t— 1)  *  vx) 

DO  j  2-1 2,u  2 
dsfay  ((/2-l)  *  v2) 


DO  «„ 
delay  ((/W)  *  **) 

5,(7)  (3.7) 

5  2(7) 


5,(7) 

END 

END 

END 

According  to  the  definition  of  DOACROSS  [3],  the 
computation  in  iteration  7  **  [/i<  —■>  J»]T  ;s  executed  at  time 
step  H7 


11-137 


3.4.  Relation  to  Wavefront  Scheduling 

The  wavefront  method  [16]  (or  hypcrplane  method  [6]) 
is  used  to  execute  nested  loops  on  parallel  ind  vector 
machines  when  no  loop  can  be  parallelized  independently  of 
the  others  in  a  loop  nest.  In  this  method,  a  set  of  wavefronts 
or  hyperplanes  is  drawn  across  the  index  set  and  all  iterations 
on  the  same  wavefront  (hyperplane)  can  be  executed  in 
parallel.  The  only  difference  between  the  wavefront  method 
and  a  feasible  generalized  selective  shrinking  with  schedule 
vector  II  is  that  the  latter  executes  several  (dispH)  wavefronts 
simultaneously,  exploiting  more  parallelism  in  the  algorithm. 
In  addition,  the  method  presented  in  this  paper  provides  a 
technique  to  find  an  optimal  wavefront  schedule. 

4.  Time  Dependence  Shrinking 

This  section  informally  describes  true  dependence 
shrinking  and  motivates  the  generalized  definition.  The 
relationship  between  generalized  selective  shrinking  and 
generalised  true  dependence  shrinking  is  also  discussed.  A 
procedure  is  proposed  to  find  the  optimal  loop 
transformation  for  generalized  true  dependence  shrinking  for 
2-dimensional  algorithms  with  constant-bounded  index  sets. 

4.1.  True  dependence  and  generalized  true  dependence 
shrinking 

In  [11],  for  an  n—  dimensional  algorithm  (J,D ),  »>1, 
true  dependence  shrinking  transforms  the  n  -dimensional 
index  set  into  a  1-dimensional  index  set  by  coalescing  [8], 
[12]  the  original  nested  loops  according  to  their  sequential 
order.  Let  <*t  and  lk  be  the  upper  and  lower  ^bounds  of  the 
kth  nested  loop,  then  each  dependence  vector  d— fdt, ...,  d„]r 
is  transformed  into  the  true  dependence  distance  [11],  the 
corresponding  dependence  vector  in  the  new  1-dimensionai 

*  n 

index  space,  24  JJ  (m  -/  +1).  Let  3  be  the  minimum  of 

i— L  f-f+1 

all  true  dependence  distances  of  the  algorithm;  then  the 
reduction  factor4  is  3,  and  3  consecutive  index  points  in  the 
new  1-dimensional  index  set  can  be  executed  in  parallel 
without  violating  dependence  relations  of  the  algorithm. 

As  an  example,  for  the  algorithm  in  Figure  1,  according 
to  the  true  dependence  shrinking  described  in  [11],  the  2- 
dimensional  index  set  in  Figure  3a  is  mapped  into  a  1- 
dimensionai  index  set  as  shown  in  Figure  3b.  Cleariy,  the 
index  points  in  Figure  3b  are  ordered  from  left  to  right 
according  to  the  exact  sequential  execution  order  of  the 
nested  loops  in  Figure  1.  The  true  dependence  distances  of 
dx  and  d2  are  3(m2-4)+5  and  2(m2-4)+4,  respectively. 
Therefore,  the  reduction  factor  is  3— 2(m2-4)+4  and 
2(“  j~4)+4  consecutive  index  points  in  the  1-dimensional 
index  set  in  Figure  3b  can  be  executed  in  parallel. 

True  dependence  shrinking  can  be  described  by  the 
mapping  r:  r(J)  -  TfJ-v),  where 

r-fnK-/,+l),...,  n  1) . 1]  (4-D 

i-l  .-**1 

It 

with  U  («*;-/, +1)  being  the  kth  entry  of  T  and 
ln]T  is  the  shift  vtctor.  An  index  point  j€/  is 

1  The  minimum  reduction  factor  i  is  dependent  on  cite 
loop  bounds,  which  may  not  be  known  until  run-time;  a  run¬ 
time  test  could  be  generated  to  determine  the  minimum  3. 


mapped  into  the  new  1-dimensional  space  as  point  T^J~—v) 
and  the  dependence  matrix  D  is  mapped  as  TD  —  [Tdj,  ..., 
Tdm].  Therefore,  the  transformed  algorithm  can  be 
considered  a  1-dimensional  algorithm  with  index  set  r[J)  and 
dependence  matrix  TD .  Clearly,  no  two  or  more  index  points 
J j,  ~i^J  are  mapped  into  the  same  point  in  the  1- 
dimensional  index  set  since  TJ\*TJ 2,  'dj\,  7’€/-  The 
reduction  factor  is  3  —  »— 1,  ...,  m).  Usually, 

TD>0  is  required  otherwise  the  new  1-dimcnsional 
algorithm  is  not  computable.  In  :he  new  1-dimcnsiona! 
algorithm,  3  consecutive  points  can  be  executed  in  parallel. 
For  the  algorithm  in  Figure  1,  the  mapping  matrix  is 
T-[m,-4.  1]  and  the  shift  vector  is  [3,  5]  .  An  index 
point  is  mapped  into  the  1-dimensional  index  set  as 
point  T(J— v).  The  new  dependence  matrix  is  [«,— 4,  l]Z)  — 
[3m  ,-7,  2a  ,-4]  and  the  reduction  factor  is 
3*»min{3a2— 7,  2a2— 4). 

Again,  some  interesting  observations  can  be  made  from 
the  example  above.  Consider  the  algorithm  in  Figure  3  with  a 
dependence  vector  d— [0,  l]r.  By  true  dependence  shrinking 
in  [11],  the  corresponding  true  dependence  distance  is  Td—1 
where  T  —  [a +1,1]  is  as__defined  in  Equation  4.1  and  the 
reduction  factor  3— mind'd,-,  »“1,  ....  m)  —  1.  Thus,  the  new 
algorithm  ( r{] ),  TD)  can  only  be  executed  sequentially  and 
no  speedup  is  possible.  However,  it  is  dear  that  if  another 
mapping  r  specified  by  a  matrix  T,  different  from  the  one 
defined  in  Equation  4.1,  it  is  possible  to  obtain  speedup  and 
explore  the  maximal  degree  of  parallelism. 

The  total  execution  time  by  the  true  dependence 
shrinking  specified  by  T  is 

maxfr(J]j7i):7;,726/)+l 
min (Tdf;  i— 1 . m) 

max(r(7l-72):7i,72€/} 
minlTd,:  »“1, ...,  m) 

Instead  of  the  mapping  matrix  T  in  Equation  4.1,  T  should 
be  chosen  such  that  the  total  execution  time  t  is  minimized. 
This  is  illustrated  by  the  example  in  Figure  6a  where 


+1  (4.2) 


_ [2  0 

andD-fd^dj]-  Q  , 

If  the  mapping  matrix  T  —  [1+**,  1]  in  Equation  4.1  is 

[l+«,  !](“)  + 1 

applied,  then  the  total  execution  time  is  - - -  - 


w(«+2)+l.  Consider  another  mapping  matrix  T'  — 

[1+n,  «*].  Cleariv  T'D  -  [2(l+»),  2«]>0  and  it  can  be 
shown  that  no  two  or  more  index  points  are  mapped  into  the 
same  image  in  the  new  index  set  (this  will  become  clear  later 
in  Section  4.2).  Hence,  this  mapping  is  feasible.  The  new  1- 
dimensionai  index  set  is  shown  in  Figure  6b.  The  total 


execution  time  is  r(7”)  — 


w(2*  +  l)+l  srtuck  ^  cleariv  much  shorter  than  the 

2m 

mapping  matrix  [l+w,l]  when  »>l.  There  may  be  several 
choices  for  mapping  matrix  T  which  have  shorter  execution 


n-138 


(* )  The  index  set  before  optimal  generalized  true  dependence  shrinking. 


00  01  10  02  11  20  03  12  2130  0413  22  3140 

o  7,  7,  r 

(b)  New  1 -dimensional  index  set  after  optimal  true  depen¬ 
dence  shrinking.  The  point  with  label  *h  is  the  image  of 
point  (*)  in  (a). 

Figure  6:  Illustration  of  optimal  true  dependence  shrinking. 


time  than  the  mapping  matrix  defined  in  Equation  (4.1).  In 
other  words,  index  points  should  be  allowed  to  project  along 
not  only  the  nth  dimension  but  also  other  directions.  These 
observations  motivate  the  definition  of  die  generalized  true 
dependence  shrinking  as  follows. 

Definition  4.1  (generalixed  true  dependence  shrinking): 
Generalized  true  dependence  shrinking  is  a  mapping  r:  J  — r(J), 

■(7)  “  T(J-v),  VT«7  T-fcx,  t2,  ....  is  the 

mapping  matrix ,  and  ...,  f„]r  is  the  shift  vector. 

Mapping  r  must  satisfy  the  following  three  conditions: 

(1)  y/i.  TiSJ'  7\  * 7 j.  TT\ 

(2)  TD>0 ,  and 

(3)  gtd(ti,  t2, ..., 

True  dependence  shrinking  in  [11]  is  a  special  case  of 
generalized  true  dependence  shrinking  with  T  as  defined  in 
Equation  4.1;  which  may  not  be  optimal  for  some 
algorithms.  Condition  1  in  the  above  definition  guarantees 
that  no  two  or  more  index  points  in  /  are  mapped  into  the 
same  point  in  the  new  1 -dimensional  index  set  r(/).  For  any 
Jlt  ~j€/,  nj2,  if  r(7i)“r(7i)  then  the  mapping  r  is  said  to 
have  conflicts.  Otherwise,  mapping  r  is  conflict-free.  Condition 
2  guarantees  that  all  dependence  vectors  are  transformed  into 
positive  true  dependence  distances  and  the  new  algorithm  is 
executable.  Because  t(aT)  —  r(T)  where  aid o  ij  an  integer 
constant,  condition  3  avoids  repeated  representations  of 
solutions.  Hence,  to  find  optimal  generalized  true 
dependence  shrinking  is  to  find  an  integral  row  vector  T 
which  minimizes  the  total  execution  time  and  satisfies  certain 
constraints: 


min 


maxtHT-Ja):  ~L,  ~;S/) 
min{T4,-:  *— l, ...,  m] 


(4.3) 


(1)  VTl.22  £/.  t/i  rdTJ2 

subject  to  |  (2)  TD>Q 

(3)  ged  (»x,...,f.)-l 

Notice  that  the  total  execution  time  r-|/]+l.  So  t  is 
minimized  iff" f  is  minimized. 


4.2.  Conflict- free  mappings 

The  concept  of  conflict-free  mapping  is  well  defined  and 
explained  in  [IS].  Necessary  and  sufficient  conditions  for 
mapping  matrix  T €2*”,  i<n  to  have  T~\*T t2,  V/],  J2Z /, 
J[  *~2  are  also  given  in  [15].  For  ciearity,  it  is  necessary  to 
review  one  definition  and  one  theorem  in  [IS]. 

Let  TZZln  be  the  mapping  matrix  from  an 
n  -dimensional  index  set  /  to  a  1 -dimensional  index  set  r(J). 
A  conflict  occurs  if  two  or  more  index  points  are  mapped 
into  the  same  point  in  r(J).  That  is,  for  two  distinct  index 
points  7i,  7. j€/>  ^  7" 7i  “_T~2,  then  there  is  a  conflict. 
Consider  a  non-zero  vector  y  such  that  Ty—O.  Let  7i‘m7i+V> 
then  TJi  -  Tj~2.  If  both  Jj  and  ~2  belong  to  the  index  set, 
then  7  and  J*i»  are  mapped  to  the  same  point  in  the  1- 
dimensional  index  set  t(J)  and  a  conflict  occurs.  One 
possible  way  to  avoid  conflicts  is  to  find  the  mapping  matrix 
T  such  that,  for  any  arbitrary  index  point  J£J jnd  anvj  that 
is  a  non-zero  integral  solution  of  equation  Ty—0,  J+y  does 
not  belong  to  the  index  set  /.  This  concept  is  illustrated  bv 
Figure  7  which  shows  a  2-dimensional  index  set  J  ~({j \,j -Of  ’ 
If  T  is  Yx-p.,1]1*,  then  index  points 
J—0  and  i/+Yi“[l,  1]*"  both  belong  to  index  set  J  and  index 
points  [0, 0]r,  [l,l]r,  [2, 2]r,  ...,  [4,4]r  will  be  mapped  into 
the  same  point  in  r[/)._Thcrefore,  there  is  at  least  one 
conflict.  However,  if  y  is  Y2"[3,5]r,  there  will  be  no  conflict 
at  all  because  for  any  arbitrary  Je/,  T+yitJ-  Intuitively,  if 
vector  [3,5]r  is  drawn  with  one  end  at  [0,0]r  (or  at  any 
other  index  point  of  the  index  set),  then  the  other  end  is  out 
of  the  index  set  and  vector  [3, 5]r  does  not  meet  any  integer 
points  in  the  index  set.  Therefore,  the  mapping  with  this  y  is 
conflict-free.  To  describe  these  concepts  formally,  the 
following  definitions  are  introduced. 


h 

Figure  7:  Non-feasible  conflict  vector  Yi  and  feasible 
conflict  vector  Y2-  Vector  y2  does  not  meet  any 
integral  points  inside  the  index  set. 


Definition  4-2  (Conflict  vector,  feasible  and  non* 
feasible  conflict  rectors  and  conflict-free  mapping 


n-139 


1991  International  Conference  on  Parallel  Processing 


matrix):  Gives  an  algorithm  (J,  D)  _and  a  mapping  matrix 
rez1”,  an  integral  column  vector  y  —  [yj,  y„]r  _js  a 
conflict  vector  of  the  mapping  matrix  T  if  and  only  if  Ty—O 
and^fd(7i,  1.  If  for  any  arbitrary  index  point  JiJ, 

J+yi /,  then  y  is  a  feasible  conflict  vector.  If  there  exists  at 
least  one  index  point  J"€ J  such  that  J+ yf  J  then  y  is  called  a 
non- feasible  conflict  vector.  If  all  the  conflict  vectors  are 
feasible,  then  T  is  conflict-free. 

For  algorithms  with  constant-bounded  index  sets,  the 
common  characteristics  of  feasible  conflict  vectors  are 
described  in  the  following  theorem. 

Theorem  4.1:  For  algorithms  with  constant-bounded  index 
sets  defined  by  Equation  3.4,  a  mapping  matrix  T  is  conflict- 
free  if  and  only  if  for  each  of  its  conflict  vectors  y— [yb  ...,  y,, 
...,  y„]r  there  exists  an  entry  y,  such  that  |y,  | 

Proof:  (— >).  Without  loss  of  generality,  let  1,-0, s'— 1,  ...,  n. 
Because  T  is  conflict-free,  all  the  conflict  vectors  of  T  are 
feasible.  Now  suppose  that  y  is  a  conflict  vector  of  T  and 
\y , •!<*,,  i-1,  ...  n.  Consider  the  index  point  J  -  [/j,  ]T 
where  f—0  if  y,  >0  and  — y,  if  y,  <0.  It  is  dear  that  both 
J  and  J+y  belong  to  the  index  set  J  defined  by  Equation  3.4 

because  (y, |<«,,  »— 1 .  n.  By  Definition  4.2,  y  is  not 

feasible  which  is  contrary  to_the  assumption.  Therefore,  for 
each  of  the  conflict  vectors  y,  there  must  exist  an  entry  y, 
such  that  |y 

(<—  ).  Let  y  be  a  conflict  vector  of  mapping  matrix  T  and 
consider  an  arbitrary  index  point  J  belonging  to  the  index  set 
defined  by  Equation  3.4.  Let  J’j-  J+y  -  [/h— 
Because  there  exists  an  entry  y,  of  y  such  that  iy,  |>«,  and 
if  T,>0  and  /,  +yi_  <  0  if  y,<  0 .  In 

both  cases,  J  is  not  in  the  index  set  J  and  y  is  feasible.  This 
implies  that  T  is  conflict-free.  Q 

Example  4.1:  Consider  a  3-dimensionil  index  set  J—[J 
0 »— 1,  2,  3  J6Z3),  mapping _matnx  T  —  [k  +1,  u,  1] 
and  the  following  solutions_of  Ty^O:  y2— : k,k+1,0]t  and 
y2— [1,—  l,-l]r.  Clearly,  Tyj  —  Ty2  -  0  and_thc  greatest 
common  divisors  of  their  entries  are  unity.  So  y2  and  y2  arc 
conflict  vectors  of  mapping  matrix  T  However,  vector 
[2,— 2,-2]r  is  also  a  solution  of  equation  T y— 0  but  is  not  a 
conflict  vector  of  mapping  matrix  T  because  the  greatest 
common  divisor  of  its  entries  is  not  unity.  Conflict  vector  y2 
is  feasible  because  it_can  be  checked  thatjbr  any  arbitrary 
index  point  j€/,  J+y1(J.  Conflict  vector  y2  is  not  feasible 
because  for  the  index  point  J— [0, 1, 1,]r  €/,  J+y2  — 
[l,0,0]r€/.  Therefore,  T  is  not  conflict-free.  □ 


as  follows. 

Consider  the  algorithm  in  Figure  6a  again,  by  Procedure 
3.1,  the  optimal  linear  schedule  is  II*  -  [1,  1].  Let  T*  — n* 
and  T'  —  [1+*,  «].  Clearly,  when  k  is  large  enough,  T'  is  in 
a  small  neighborhood  of  T* .  All  conflict  vectors  of  mapping 
matrix  T‘  can  be  expressed  as  y-X[-«  ,l+*]r  where  X  —  Tl. 
Clearly,  these  conflict  vectors  are  feasible  by  Theorem  4.1. 
The  corresponding/(r*),  r(T*),  f(T')  and  r(T'),  defined  in 
(4.3)  and  (4.2),  are  as  follows,  respectively: 


ten- 

f(T)~ 


[M]C>+1 

2« +1 

“+T 

2 

2 

[!+*,*<) 

«(2«+l) 

hi 

- -I  +— 

2u  2u  2 


— «+ 1 


t(T')- 


[1+*,k]£)+1 

u(2u+l)+l 

,  M+l 

U 

2  u 

“+  2k 

-H+l 


Clearly,  t(T')  —  t(T *)  so,  this  mapping  is  optimal. 
Intuitively,  T*'  is  in  a  very  small  neighborhood  of  T*.  By 
rotating  T*  a  little  bit,  all  points  can  be  projected  along  T’ 
such  that  no  two  or  more  points  are  projected  in  to  the  same 
position  on  T'.  Because  T'  and  7*  are  so  close,  the  difference 
between  t(T')  and  t(T")  is  zero  because  of  the  ceiling 
function,  as  defined  in  (4.2). 

In  the  following,  finding  an  optimal  generalized  true 
dependence  shrinking  for  2-dimcnsional  algorithms  with 
constant-bounded  index  sets  is  discussed  by  using  the  idea 
mentioned  above.  How  to  find  optimal  solutions  for 
algorithms  with  any  convex  polyhedron  index  sets  is  under 
investigation. 

Lenuna  1:  Let  T  -  [rb  t2 . tn]  and  jcd(t2,  t2 . tn)  -  1. 

Then  there  exist  a  positive  integer  to  and  an  index  i  €  (1,  ..,  n  J 
such  that/ed(rjir . r,  w+1, ...,  r„  w)  —  1. 

Proof:  Because  gci(t2,  t2,  ...,  r„)  •»  I,  T  has  at  least  one 
nonzero  entry  and  there  exist  two  elements  tk  and  t,  such 
xhzt£cd(tk,  tt)— 1.  Without  loss  of  generality,  let  1.  1—2 
and  tjsso.  Let  w— |rjjw'  where  w'  is  an  arbitrary  positive 
integer.  Consider  the  pair  (r,F,  r2ir+l)  and  it  is  shown 
next  that  ged^^to.  r2w+l)  -1.  Suppose  gcd(txv. 
t jw+l)— fl.  Then 


4.3.  Finding  optimal  generalised  shrinking  for  2- 
dimensional  algorithms 

Notice  that  if  constraint  I  in  (4.3)  is  removed,  this 
optimization  problem  is  exactly  the  same  as  optimal 
generalized  selective  shrinking  problem  defined  in  (3.3). 
Therefore,  our  approach  to  find  the  optimal  solution  to  (4.3) 
is  as  follows.  Firs:,  the  optimal  solution  T*— n*  to  (3.3) 
without  consideration  of  conflicts  is  obtained  by  Procedure 
3.1.  Without  any  doubt,  this  7*  is  not  conflict-free.  Then,  as 
illustrated  in  Figure  6,  a  mapping  matrix  T'  is  selected  from 
a  small  neighborhood  of  T*  such  that  T  is  conflict-free  and 
r(T'W(T*).  Because  V  is  the  optimil  solution  without 
considering  conflicts,  r(T*)  is  a  lower  bound  on  the  total 
execution  time  of  any  solutions  to  (4.3):  hence,  7"  must  be 
optimal.  This  idea  is  illustrated  by  the  example  in  Figure  6a 


tjw—  fl*,  or  r  j  jr ;  bp  —flu  ■  (4.4) 

and 

fjW+1— or  —t2to  +&a2  —  1  (4.5) 

According  to  [7],  equation  ax  +by  —1  has  integral  solution  iff 
fit d(o,  4 )— 1  and  ged  (x .  y)— 1.  Therefore,  because  of 
Equation  4.5 ,gcd(w,  0)  —1.  By  Equation  4.4,  8  consists  of 
factors  of  either  r.  or  ip'  or  both.  In  both  cases,  8  is  a  factor 
of  ip.  Therefore,  it  must  have  0—1  otherwise, ged (w,  fl)>  1. 
Hence,  gcd{t1w,  t2ro+l)—0  —  1  which  implies  gedityw, 
t2to+l,t3w . tn  w)  —  1.  0 

Lemma  2:  Consider  a  2-dimensional  algorithm  with  J— (J: 

*i,ji €Z ,  t— 1,2).  Let  T  -  [r,,  rj  tndgcd(t2,  t2) 
—  1,  Without  loss  of  generality,  let  t:  W0.  Let  w  be  a  positive 


II- 140 


1991  International  Conference  on.  Parallel  Processing 


integer  such  that  (1)  w  —  )f  1|w'  for  a  positive  integer  w'  and 

(2)  »|rj|>«3.  Then  mapping  matrix  T‘  —  [f1w,  r2w  + 1]  is 
conflict  free. 


Proof:  Let  y  —  [yj,  y£T  and  consider  the  following  equation 


T'— 0 . 


(4.6) 


Clearly,  ail  conflict  vectors  of  T'  can  be  expressed  as  follows: 


y\ 

l* 


fr2w+l 

* 


(4.7) 


where  X  is  a  constant  such  that  y  is  integral  indeed  (t,  ....  7.) 
-  1.  By  Lemma  l,^rd(r2*»+l,  —  rtw)  —  L.  Hence,  X  must  be 
either  —l  or  l  to  have  the  above  two  conditions  satisfied. 
3ecause,  |y2|  —  ir,(w  >w2>  foe  any  value  of  X,  7  is  a  feasible 
conflict  vector  by  Theorem  4.1.  So  T'  is  conflict-free.  □ 


Theorem  4.2:  Consider  a  2-dimensionai  algorithm  (/,  D) 
with  index  set  /— (J:  0<y,  ui,jiGZ,i~  1,2).  Let  X*  — 

(tj,  rj  be  the  optimal  solution  to  (3.3)  found  by  Procedure 
3.1.  Without  loss  of  generality,  let  r2* 0.  Then  T‘  —  [r2», 
r2w+i]  is  the  optimal  solution  to  problem  (4.3)  where  w  is  a 
positive  integer  such  that  (1)  w  is  a  multiple  of  jrl|,  (2) 
w>u2) |T[|  and  (3)  w  is  reasonably  large. 

Proof:  By  Lemmas  1  and  2,  T'  is  conflict-free  and  the 
greatest  common  divisor  of  the  entries _or  T'  is  unity.  Let  d, 
be  such  that  minlT'd,-:  »— 1, ....  m}— T* d,  and  pi  be  such  that 
max{T* (J\-Ti)- 7\,  ~2€/l  -  T'S-  Then 

;7T..  ,  max(T*  (7V~ ~2):  7i.  ~zg/)  ... 

minlT'd,-:  m)  Td, 


Because  f  is  in  1  very  small  neighborhood  of  T*_md  so  is 
reasonably  large,  minlT'i, :  i*»l, ...,  m)—T'dt  and 
max(T'(7i“Ji):  Ti»  Ti^/1  “  when  w— Hence 
f  max(r(7^-72):7t,726/) 

min{T'd,-:  *“L»  — ,  »•)  T’d, 

T'i,  -  (wT*  +  [0,  l])dt  -  wir'd,  +  dj,  ind  TJj,  -  (»r  + 
[0,l])pl  •»T,p„  +p».So 


/m 


o'Tdf+d^ 


and 


/(T>/(T*)+ 


P^.T’dt-d^T’p^, 

wT*d,+d2t)T'd, 


Therefore,  when  w  is  large  enough,  1/(7")]  —  [/(T*)  I  and 
f (T1)  -r(T*).  □ 

Notice  that  from  (3.3)  and  (4.3),  the  total  execution 
tune  by  an  optimal  generalized  true  dependence  shrinking 
cannot  be  shorter  than  optimal  generalized  selective 
shrinking.  In  the  selective  shrinking,  the  original  index  set 
can  be  thought  as  being  transformed  into  a  1 -dimensional 
index  set  by  the  mapping  IP.  A  single  point  in  the  new  1- 
dimensional  index  set  is  the  image  of  multiple  points  in  the 
original  index  set  which  can  be  executed  in  parallel  without 
violating  dependence  relations.  Optimal  true  dependence 
shrinking  is  basically  the  same  mapping  as  the  selective 
shrinking  except  that  it  requires  a  one-to-one  mapping.  In 
some  cases,  the  requirement  of  one-to-cne  mapping  is  not 
necessary  if  the  purpose  is  to  identify  all  computations  which 
can  be  executed  in  parallel. 


5.  Conclusions 

In  this  paper,  two  loop  transformation  techniques 
proposed  in  [11]  are  generalized  in  irder  to  explore  more 
parallelism  of  algorithms  with  nested  loops.  Methods  to  find 
optimal  transformations  are  discussed.  For  the  generalized 
selective  shrinking,  the  method  finds  time-optimal 
transformations  for  algorithms  with  any  convex  polyhedron 
index  sets.  For  the  generalized  true  dependence  shrinking, 
the  method  finds  optimal  solutions  for  2-dimensional 
algorithms  with  constant-bounded  index  sets.  How  the  two 
transformation  techniques  are  related  is  also  briefly  discussed. 

References: 

[1]  M.S.  Bazaraa  and  C.M.  Shetty,  Nonlinear  ’programming: 
Theory  and  Algorithms,  John  Wiley  &  Sons  1979. 

[2]  M.  Bvier,  J.  R.B.  Davies,  C.  Huson,  B.  Leisure,  M. 
Wolfe,“Muitiple  Version  Loops/  Proc.  Int.  Con/,  on 
Parallel  Processing ,  pp.  312-318,  August  1987,  St. 
Charles,  IL. 

[3]  R.  Cytron,  “Doacross:  Beyond  Vectorization  for 
Multiprocessors,”  Proc.  Int.  Conf.  on  Parallel  Processing, 
pp.  836-844,  August  1986,  St.  Charles,  IL. 

[4]  J.A.B.  Fortes  and  B.W.  Wah,  “Systolic  Array  —  From 
Concept  to  Implementation,”  IEEE  Computer,  Julv 
1987,  pp.  12-17. 

[5]  R.M.  Karp,  R.E.  Miller  and  S.  Winograd,  “The 
Organization  of  Computations  for  Uniform  Recurrence 
Equations,”  Journal  of  the  ACM,  Vol.  14,  No.  3,  pp. 
563-590,  July  1967. 

[6]  L.  Lamport,  “The  Parallel  Execution  of  DO  Loops,” 
Comm,  of  the  ACM,  Vol.  17,  No.  2.  Feb.  1974,  pp.  83- 
93. 

[7]  L.J.  Mordeil,  Diophantme  Equations,  Academic  Press, 
New  York,  1969,  pp.  30. 

[8]  M.  T.  O’Keefe  and  H.  G.  Dietz,  “Loop  Coalescing  and 
Scheduling  for  Barrier  MIMD  .Architectures,”  submitted 
June  1990  to  IEEE  Trans,  on  Parallel  and  Distributed 
Systems. 

[10]  J.-K  Peir  and  R.  Cytron,  “Minimum  Distance:  A 
Method  for  Partitioning  Recurrences  for 
Multiprocessors,”  IEEE  Tram,  on  Computers,  Vol.  38, 
No.  8,  pp.  1203-1211,  August  1989. 

[11]  C.D.  Poivchronopoulos,  “Compiler  Optimizations  for 
Enhancing  Parallelism  and  their  Impact  on  Architecture 
Design,”  IEEE  Trans,  on  Computers,  Vol.  37,  No.  8, 
August  1988,  pp.  991-1004. 

[12]  C.  D.  Polychronopoulos,  Parallel  Programming  and 
Compilers.  Boston:  Kluwer  Academic  Publishers,  1988. 

[13]  W.  Shang  and  J.A.B.  Fortes,  “Time  Optimal  Linear 
Schedules  for  Algorithms  with  Uniform  Dependencies,” 
Proceedings  of  Int.  Conf.  on  Systolic  Arrays,  May  1988, 
pp.  393-402  (also  to  appear  in  IEEE  Trans,  on 
Computers). 

[14]  W.  Shang  and  J.A.B.  Fortes,  “On  Optimality  of  Linear 
Schedules,”  Journal  of  VLSI  Signal  Processing,  1,  pp. 
209-220  (1989),  Xiuwer  Academic  Publishers,  Boston. 

[15]  W.  Shang  and  J.A.B.  Fortes,  “Time-Optimal  and 
Conflict-Free  Mappings  of  Uniform  Dependence 
Algorithms  into  Lower  Dimensional  Processor  Arrays,” 
Proc.  of  Ins' l  Conf.  on  Parallel  Processing,  August  1990, 
pp.  101-110  (I)  (also  to  appear  in  IEEE  Trans,  on 
Parallel  and  Distributed  Systems). 

[16]  M.  Wolfe,  “Loop  Skewing:  The  Wavefront  Method 
Revisited,”  Infl  J.  of  Parallel  Programming,  Vol.  15, 
No.  4,  1986,  pp.  279-293. 

[17]  M.  J.  Woife,  Optimizing  Siepcrtompslert  for 
Supercomputers,  Cambridge,  MA:  MIT  Press,  1989. 


n-i4i 


REFERENCE  NO.  17 


Cam,  H.  and  Fortes,  J.  A.  B„  'Tault-Tolerant  Self-Routing  Permutation  Networks," 
1992  International  Conference  on  Parallel  Processing,  August  17-21,  St  Charles,  Illi¬ 
nois,  Vol.  I,  pp.  243-247. 

Note  -  this  paper  proposes  a  novel  fault-tolerant  interconnection  network  capable  of 
implementing  any  permutation  using  distributed  destination-tag  based  self-routing  of 
messages.  The  time  needed  to  implement  any  permutation  is  the  smallest  of  any  net¬ 
work  with  comparable  hardware  complexity. 


_ 1 992  International  Conference  on  Parallel  Processing _ 

FAULT-TOLERANT  SELF-ROUTING  PERMUTATION 

NETWORKS* 


Hasan  Cam  and  Jose  A.3.  Fortes 
School  of  Electrical  Engineering 
Purdue  University,  West  Lafayette,  IN  47907  (USA) 


Abstract  —  Four  different  self-routing  permu¬ 
tation  networks  are  presented  in  this  paper.  The 
first  self-routing  permutation  network  is  constructed 
from  concentrators  and  digit- controlled  2x4 
switches.  This  self-routing  permutation  network  has 
0  (log2  V)  gate-level  delay  with  a  very  small  constant 
and  use  OfA^logJV)  VLSI-level  hardware.^  The 
propagation  time  of  this  network  is  less  than  that  of 
any  self-routing  permutation  network  presented  to 
date.  The  second  self-routing  permutation  network 
is  obtained  from  the  first  self-routing  permutation 
network  by  replacing  2x4  switches  at  some  stages  by 
short-circuit  switches  and  removing  the  concentra¬ 
tors  at  these  stages.  This  network  is  proposed  to 
avoid  having  very  large  concentrators  when  iV  gets 
large.  The  third  self-routing  permutation  network 
consists  of  a  log N-stage  network  followed  by  one  con¬ 
centrator  whose  outputs  are  connected  to  the  inputs 
of  the  network.  Finally,  a  1-fault-tolerant  self¬ 
routing  permutation  network  is  obtained  by  adding 
intrastage  links,  multiplexers  and  demultiplexers  to 
the  first  self-routing  permutation  network. 

1.  INTRODUCTION 

An  interconnection  network  (IN)  with  N  =»  2" 
inputs/outputs  is  called  a  permutation  network  (or 
rearrangeable  network)  if  it  realises  every  one  of  the 
iV!  permutations  in  a  single  pass.  This  paper 
presents  four  different  self-routing  permutation  net¬ 
works  constructed  from  concentrator^)  and 
different  digit-controlled  switches.  Each  of  these  net¬ 
works  realises  a  given  permutation  using  the  desti¬ 
nation  tag  routing  scheme.  Three  of  these  networks 
are  based  on  a  new  approach  which  does  not  allow 
any  conflict  to  occur  in  switches.  Because  the  sorting 
is  the  rearrangement  of  items  into  ascending  or  des¬ 
cending  order  |2),  these  networks  can  also  be  used  as 
sorting  networks. 

Sorting  networks  are  often  constructed  using 
two-input,  two-output  comparators  that  send  the 
smaller  of  their  two  inputs  to  their  upper  output.  A 
comparator  compares  its  two  logjV-bit  inputs  to 
determine  its  state.  Because  a  comparator  can  be 
replaced  by  logJV  1-bit  wide  comparators,  the  gate- 
level  (also  referred  as  bit-level)  delay  of  sorting  net¬ 
works  with  N  inputs/outputs  can  be  determined  by 
multiplying  their  comparator-level  delay  by  logN 
(l,2j.  Although  the  delay  of  the  self-routing  permu- 

*  Rwwcb  supported  ia  part  by  the  Office  of  Naval 
Research  under  contract  No.  00014-90-J-14S3  aad  ia  part 
by  the  Innovative  Science  aad  Technology  Office  of  the 
Strategic  Defense  Initiative  Orgaaitatioa  aad  administered 
through  the  Office  of  Naval  Research  under  contract  No. 
00014-S8-k-0723. 

'  All  logarithms  are  in  base  2  unless  stated  otherwise. 


tation  networks  presented  in  this  paper  are  stated  in 
gate-level  the  area  complexities  of  our  self-routing 
permutation  networks  are  expressed  in  VLSI-level 
instead  of  gate-level  because  the  concentrators  used 
in  these  networks  contain  gates  whose  sixes  vary 
with  the  number  of  inputs. 


Batcher  [3]  presented  both  the  bitonic  network 
and  the  odd-even  network,  both  of  which  take 
0  (log2  N)  comparator-level  time  to  sort  input  vec¬ 
tors  of  sise  N.  Ajtai  et  al.  [4)  introduced  a  network 
of  sise  O(jVlogJV)  that  can  sort  in  O(loglV)  steps, 
which  is  asymptotically  superior  to  the  existing  sort¬ 
ing  networks,  but  not  practical  because  of  its  large 
constant.  Koppelman  and  Oruc  flj  described  a  self¬ 
routing  permutation  network  which  has  0(log37V) 
gate-level  delay  and  uses  0(N  log3  TV)  gate-level 
hardware.  Jan  and  Oruc  [9]  presented  two  self- 
routing  n  permutation  networks,  one  of  which  has 
0(Nlog2.V)  gate-level  cost  and  0  (log3  IV)  gate-level 
delay.  If  VLSI-area  (instead  of  simply  gate-level 
area)  is  considered,  then  Jan  and  Oruc’s  network 
has  O  (7V2loglV)  VLSI-level  cost  as  a  consequence  of 
the  results  in  [lO.llj  which  show  that  cube-type  net¬ 
works  use  0(N *)  layout-area.  One  of  the  self- 
routing  permutation  networks  presented  in  this 
paper  has  0(log2N)  gate-level  delay  and 
OffiPlogN)  VLSI-level  cost. 

This  paper  is  organised  as  follows.  Section  2 
briefly  explains  radix  sorting  and  the  destination 
routing  scheme  and  points  out  the  close  relation 
between  them.  In  Section  3,  the  configuration  and 
performance  analysis  of  a  self-routing  permutation 
network  are  described.  Section  4  presents  three 
additional  self-routing  permutation  networks  which 
are  obtained  by  modifying  the  self-routing  permutar 
tion  network  presented  in  Section  3.  Section  5  is 
dedicated  to  the  conclusions. 


2.  RADIX  SORTING  AND 
CONFLICTS 


Some  self-routing  INs,  called  cube-type  net¬ 
works,  employ  the  so-called  destination  tag  routing 
scheme  for  realising  any  passable  permutation 
through  them  [7].  In  this  scheme,  the  destination 
address  of  the  tth  input,  0  <  i  <  IV  —  1 ,  is  used  as  the 
routing  tag  for  the  tth  input  and  a  2x2  switch  at 
stage  k  for  1  examines  the  Ath  bit  of  the  des¬ 

tination  address  of  the  incoming  input:  if  the  At h  bit 
is  0,  then  the  upper  output  of  the  switch  is  taken; 
otherwise,  the  lower  output  is  taken.  In  some  cube- 
type  network?  such  as  the  baseline,  the  way  the  des¬ 
tination  tag  routing  scheme  works  is  closely  related 
to  radix  sorting.  To  see  this,  let  S  be  the  set  of  the 
binary  representations  of  ,V  numbers  from  0  to  N—  1 
in  some  order.  Radix  sorting  sorts  the  elements  of  5 


1-243 


1 992  International  Conference  on  Parallel  Processing 


recursively  as  follows:  in  the  first  iteration,  the  set 
S  is  partitioned  into  two  subsets,  say  5t  and  S», 
such  that  the  moat  significant  bits  of  all  the 
numbers  of  5;-,  /€{  1,2},  are  the  same.  In  the  second 
iteration,  the  set  Sj  is  partitioned  into  two  subsets 
such  that  the  second  most  significant  bits  of  all  the 
numbers  of  each  subset  are  the  same.  The  partition¬ 
ing  process  of  a  set  into  two  subsets  of  equal  sise  is 
iterated  for  the  third  most  significant  bit,  the  fourth 
most  significant  bit,  and  finally  the  least  significant 
ait.  Now,  let  us  consider  the  realisation  of  a  pass¬ 
able  permutation  through  the  baseline  network 
using  the  destination  tag  routing  scheme.  The  first 
stage  partitions  the  given  set  of  N  numbers  into  two 
subsets  such  that  the  most  significant  bits  of  the 
destination  addresses  of  all  the  inputs  at  the  upper 
(respectively,  lower)  half  of  the  second  stage  are  0 
(respectively,  1).  Similar  to  the  iterations  of  the 
radix  sorting,  tne  kth  stage  of  the  baseline  network 
partitions  every  incoming  subsets  of  the  inputs  into 
two  subsets  with  respect  to  the  kth  most  significant 
bits  of  the  destination  addresses  of  their  inputs. 

Cube-type  networks  pass  a  small  fraction  of 
N\  permutations  because,  for  some  permutations, 
“conflicts”  occur  in  the  switches  when  the  kth  bits 
of  the  destination  addresses  of  the  2  inputs  consti¬ 
tute  either  the  set  {0,0}  or  the  set  {1,1}.  The  net¬ 
works  proposed  in  this  paper  use  2x4  switches  to 
prevent  conflicts  and  are  based  on  the  idea  of  radix 
sorting.  This  implementation  is  different  from  that 
of  Jan  and  Oruc  [9|  in  many  respects,  although  both 
implementations  have  some  common  points.  The 
network  proposed  by  Jan  and  Oruc  first  demulti¬ 
plexes  each  input  onto  all  N  outputs  of  a  distributor 
through  a  binary  tree  of  jV—  1  vertices,  then  the 
second  half  of  the  network  concentrate  these  inputs 
to  their  appropriate  outputs.  The  approach  of  this 
paper  replaces  the  2x2  switches  by  the  2x4  switches 
to  enable  the  2  inputs  of  every  switch  to  go  to  next 
stage  simultaneously  even  if  their  control  bits  con¬ 
stitute  the  set  {0,0}  or  {1,1}.  But,  no  matter  what 
the  2  inputs  of  a  2x4  switch  are,  the  2  outputs  of 
any  switch  cany  invalid  messages  to  the  next  stage. 
In  order  to  detect  and  d:scard  these  invalid  messages 
before  they  arrive  at  the  switches  of  next  stage,  con¬ 
centrators  are  provided  between  stages. 

3.  SELF-ROUTING  PERMUTATION 
NETWORK 

If  a  network  sets  its  swicthes  using  a  distri¬ 
buted  routing  scheme  in  which  each  switch  deter¬ 
mines  its  own  setting  dynamically  by  examining  the 
routing  tag  bits  of  the  inputs,  then  this  network  is 
said  to  be  self-routing.  In  this  section,  a  self-routing 
permutation  network  constructed  from  concentra¬ 
tors  and  digit-controlled  2x4  switches  is  presented. 
First,  the  definitions  of  concentrator  and  hypercon¬ 
centrator  are  given  below. 

An  (N,M)  concentrator  is  an  IN  that  has  N 
inputs  X,  ,X«,...,XW  and  M  <  N  outputs 
and  can  establish  M  disjoint  paths 
from  any  set  of  S4  inputs  to  the  M  outputs  [6j.  An 
IN  with  N  inputs  and  :V  outputs  is  called  a  N- 
kypereoncentrator  if  there  are  R  disjoint  paths,  for 
any  1 5  R  £  N,  from  each  set  of  R  inputs  to  the  first 


R  outputs  r,,r2l..,y,.  This  implies  that  any  N- 
hyperconcentrator  can  be  used  as  a  (.Y ,Af)  concen¬ 
trator  by  simply  choosing  the  first  M  outputs  of  the 
hyperconcentrator  as  the  Si  outputs  of  the  concen¬ 
trator.  In  this  paper,  the  hyperconcentrator  of  Cor- 
men  and  Leiserson  [6]  is  used  as  a  concentrator. 

3.1.  Network  Configuration 

The  first  self-routing  permutation  network 
introduced  in  this  paper  is  called  PN  and  can  be 
constructed  recursively.  The  first  iteration  of  the 
recursive  construction,  shown  in  Figure  1,  is  com¬ 
posed  of  the  input  and  output  stages.  The  input 
stage  consists  of  iV/2  digit-controlled  2x4  switches 
and  two  (N,iY/2)-concentrators.  The  output  stage 
consists  of  two  copies  of  a  iV /2-input  PN  in  parallel. 
The  PNs  with  smaller  inputs/outputs  are  decom¬ 
posed  recursively  until  N/ 2  switches  of  size  2x2  are 
obtained  in  the  last  iteration.  (Note  that  at  the  last 
stage  of  PN,  there  is  no  need  to  employ  2x4 
switches  because  the  outputs  of  a  switch  are  con¬ 
nected  to  only  2  outputs  of  PN).  This  implies  that  a 
PN  with  Ninputs/outputs  can  oe  constructed  recur¬ 
sively  in  log  N  iterations.  Thus,  PN  is  composed  of 
log  N  stages  labeled  from  left  to  right  starting  with 
1.  For  1£  k<  «  — 1 ,  the  stage  k  consists  of  2*-1  2x4 
switches  followed  by  2*  (2"*l-‘,2"~*)  concentrators 
numbered  from  0  to  2k—  1.  The  stage  n  consists  of 
2*-'  2x2  switches.  As  an  example,  Figure  2  illus¬ 
trates  PN  with  N=8  inputs/outputs. 

3.2.  Routing  Scheme 

All  the  inputs  routed  by  PN  to  their  destina¬ 
tions  are  bit-serial  data  streams  in  a  packet  format. 
Each  packet  is  divided  into  header  and  data  sec¬ 
tions;  the  header  contains  the  destination  address  of 
the  packet  and  a  single  “valid  bit”,  which  is  “1”  if 
the  packet  contains  valid  message  and  “0”  if  it  con¬ 
tains  invalid  message.  Those  2  outputs  of  any  2x4 
switch  which  are  not  connected  to  any  of  its  inputs 
contain  invalid  messages.  To  inform  the 
concentrator(s)  that  they  carry  invalid  messages,  the 
switch  appends  a  valid  bit  of  0  to  the  header  of  each 
invalid  message.  On  the  other  hand,  the  valid  bits 
of  the  other  2  outputs  rkat  are  connected  to  the 
incoming  inputs  are  made  1.  These  valid  bits  of  the 
inputs  are  used  during  setup  the  conducting  paths  of 
a  concentrator,  so  that  a  valid  message  arrives  at  an 
output  of  the  concentrator  as  opposed  to  an  invalid 
message  which  is  not  routed  any  output  of  the  con¬ 
centrator.  Also,  the  concentrators  used  in  PN 
require  all  n  bits  of  each  invalid  message  to  be  0’s. 

In  PN,  the  destination  tag  routing  scheme  is 
employed  to  realize  any  permutation  through  it. 
The  switches  and  the  concentrators  of  PN  are  “self¬ 
setting”  when  inputs  are  presented  to  them.  No 
matter  what  the  inputs  are,  the  switches  never  have 
a  conflict  because  both  inputs  can  be  routed  to  the 
same  concentrator.  For  15  s-1,  if  the  kth 

most  significant  bit  of  the  destination  tag  of  the 
upper  input  to  a  switch  at  the  kth  stage  is  0,  then 
the  upper  input  of  the  switch  is  routed  to  the  first 
(topmost)  output;  otherwise,  the  upper  input  is 
routed  to  the  third  output.  If  the  kth  most 
significant  bit  of  the  destination  tag  of  the  lower 
input  to  a  switch  at  the  kth  stage  is  0,  then  the 


1-244 


1 992  International  Conference  on  Parallel  Proc8Ssing 


lower  input  of  the  switch  is  routed  to  the  second 
output  from  top;  otherwise,  the  lower  input  is 
routed  to  the  fourth  output.  If  both  kth  bits  of  the 
inputs  are  0’s,  then  both  inputs  ire  routed  to  the 
upper  two  outputs,  called  upper-connection:  if  both 
Hh  bits  are  l’s,  then  they  are  directed  to  the  lower 
two  outputs,  called  lower-connection.  The  input 
inks  of  a  2x4  switch  are  labeled  by  0  and  1,  while 
the  output  links  are  numbered  0,  0,  1,  and  1  from 
top  to  bottom  (see  Figure  3  for  the  illustration  of 
the  four  states  of  a  2x4  switch). 

3.3.  Concentrators  Used  in  the  PN 

The  PN  uses  the  Af-hyperconcentrator. 
M  —  2"  and  2Smin,  of  Cormen  and  Leiserson  [6) 
as  an  [M,M /2)  concentrator  by  simply  choosing  the 
first  M /2  outputs  of  the  hyperconcentrator  as  the 
M/2  outputs  of  the  [M,M  / 2)  concentrator.  Because 
an  (M,M /2)  concentrator  in  PN  have  exactly  M/2 
valid  and  M / 2  invalid  messages  at  its  inputs,  every 
output  of  the  concentrator  carries  a  valid  message 
to  the  next  stage.  Their  M-hyptr concentrator 
routes  all  the  valid  messages  to  the  first  outputs  and 
does  not  route  any  of  the  invalid  messages  at  all. 

Theorem  1.  The  network  shown  in  Figure  1 
is  a  self- routing  permutation  network. 

Proof.  Bach  of  the  2x4  switches  at  the  first 
stage  of  Figure  1  examines  the  most  significant  bits 
of  the  destination  addresses  of  its  inputs  to  set  itself 
to  one  of  the  four  states  shown  in  Figure  3.  Because 
a  2x4  switch  is  connected  to  each  of  the  two  concen¬ 
trators  through  2  links,  no  conflict  occurs  in  the 
switches.  So,  a  2x4  switch  routes  the  input  whose 
destination  address'  most  significant  bit  equals  0 
(respectively,  l)  to  the  upper  concentrator  (respec¬ 
tively,  the  lower  concentrator)  . 

Because  the  realisation  of  permutations  on  N 
integers  through  the  network  is  considered,  the  most 
significant  bits  of  the  N/2  destination  addresses  are 
1  and  the  most  significant  bits  of  the  other  half  is  0. 
Therefore,  each  of  the  two  [N,N  / 2)  concentrators 
has  jV/2  valid  and  N/ 2  invalid  messages.  These 
concentrators  route  their  only  valid  messages  to 
their  outputs.  Finally,  each  of  the  2  permutation 
networks  at  the  last  stage  of  Figure  1  route  their 
inputs  to  their  destinations.  Therefore,  the  network 
shown  in  Figure  1  is  rearrangeable  and,  hence,  is  a 
permutation  network.  Due  to  the  fact  that  the  con¬ 
centrators  are  self-routing  and  the  switches  set 
themselves  “online”  by  examining  the  destination 
addresses  of  the  inputs,  the  network  illustrated  in 
Figure  1  is  also  self-routing.  O 

3.4.  Performance  Analysis 

In  this  section,  the  hardware  cost  and  the 
propagation  delay  of  the  PN  are  computed.  In 
deriving  the  computations,  the  propagation  delay  of 
any  switch  used  in  PN  will  be  taken  as  1.  In  other 
words,  the  propagation  delay  of  a  digit-controlled 
2x4  switch  is  the  basic  unit  of  time.  Let  Apy  and 
Dps  denote  the  hardware  cost  and  the  propagation 
delay  of  PN,  respectively. 


3.4.1.  The  Propagation  Delay.  Dpkj 

Because  PN  can  be  constructed  recursively,  its 
propagation  delay  can  be  described  with  the  help  of 
Figure  1  by  the  following  recurrence  equation: 

DPn(N)  =  Dpn(N, 2)  +  DclN,N/ 2)  -  i>s  (3.1) 

where  Dpy(N)  is  the  propagation  deiay  of  PN  with 
N  in  puts /outputs,  Dc(N,N/2)  is  the  propagation 
delay  of  an  (N,N/2)  concentrator,  and  Ds  is  the 
delay  of  a  2x4  switcn  which  equals  1.  Because  PN 
uses  the  N- hyperconcentrator  of  Cormen  and  Leiser- 
soa  (6j  as  a  \N,N / 2)  concentrator,  DC(N,N /2)  has 
2logjV  gate-level  delay  [6j,  Le. 
Dc[N,N/2)  —  21og  N.  So,  the  equation  (3.1)  can  be 
rewritten 

DpN(N)  =  Dpn(N/ 2)  +  2  log IV  +  l  (3.2) 

=  log2  IV  +  21og N  -  2  (3.3) 

Hence,  realising  a  permutation  through  PN  takes 
O(Iog-jV)  gate- level  delay. 

3.4.2.  The  Hardware  Cost,  Apn 

The  cost  is  given  by  the  recurrence: 

Apn(N)  -  2A?n{N/2)  +  2Ac(N,N/2)  +  Aat  (3.4) 

where  Apy(N)  is  the  cost  of  PN  with  N 
inputs/outputs,  Ac(N,N/2)  is  the  cost  of  an 
(N,N/2)  concentrator,  and  Aat  is  the  cost  of  N / 2 
switches  and  ail  the  interconnection  links  in  a  stage 
of  the  PN.  Because  the  iV-hyperconcentrator  is  used 
as  an  (N,N/2)  concentrator,  the  cost  of  the  N- 
hyperconcentrator  Ag[N,N)  equals  AC(N,N  /  2). 
Because  the  iV-hyperconcentrator  of  Cormen  and 
Leiserson  uses  9{N2)  components  and  has  area 
e(JV2),  Ac{N,N/2)  .  N2. 

According  to  illj.  A,t  for  a  shuffle-exchange 
stage  is  Oim/loi^N)  and  0(N2  /logN),  respec¬ 
tively.  Therefore,  it  is  assumed  that  Aat  =  0{N2). 
When  these  values  are  substituted  in  (3.4),  we 
obtain 

Ap„(N)  m  2Apn(N/2)  +  2 N2  +  N2  (3.5) 

-  N/2  +  3  N2  (n— 1)  -3/2  N2  +  3 JV  (3.6) 

The  equation  (3.6)  shows  that  PN  uses  OljVMog  N) 
VLSI-level  hardware.  Table  1  compares  the  propa¬ 
gation  delay  and  cost  of  PN  with  those  of  the  net¬ 
works  of  Batcher  [3)  and  Jan  and  Oruc  [9]. 

4.  THREE  SELF-ROUTING 
PERMUTATION  NETWORKS 

First  network:  The  network  PN  presented  in 
Section  3  employs  the  .V-hyperconcentrator  intro¬ 
duced  in  [6j  at  the  first  stage.  This  N~ 
hyperconcentrator  has  loglV  stages  of  merge  boxes, 
each  of  which  routes  all  its  valid  messages  to  the 
first  outputs.  The  last  merge  box  at  the  last  stage 
of  the  iV-hyperconcentrator  has  N  inputs  and  N  out¬ 
puts.  Therefore,  when  N  gets  larger,  Implementing 
the  merge  boxes  at  the  last  stages  of  the  N- 
hyperconcentrator  may  require  multiple  chips.  To 
avoid  having  large  merge  boxes,  the  short-circuit 
switches  introduced  in  [8)  can  be  used  at  some  of  the 


1-245 


1992  International  Conference  on  Parallel  Processing 


f 


first  stages  of  the  network.  If  these  switches  are 
used  only  in  a  few  first  stages^  of  PN,  the  propaga¬ 
tion  time  remains  in  0(lo%2N)  gate-level  delay 
because  the  extra  delay  of  these  short-circuit 
switches  is  a  small  constant. 

Second  network:  This  network,  called  MPN 
and  shown  in  Figure  4  for  N= 18,  consists  of  an  n- 
stage  network  followed  by  a  concentrator  C{2N,N ) 
whose  tth  output,  0^  t<  jV—  1,  is  connected  back  to 
the  tth  input.  This  single  (2iV,iY1-concentrator  can 
also  be  replaced  by  two  (N,N/ 2}  concentrators  in 
parallel.  All  the  switches  are  digit-controlled,  but 
the  switches  at  stages  2  through  n  are  4x4,  while  all 
the  switches  at  the  first  stage  are  2x4. 

Routing  Scheme  of  MPN.  The  outputs  of  the 
concentrator  are  recycled  Back  n— 1  times  as 
inputs,  so  that  the  given  inputs  pass  through  MPN 
n  times.  During  the  kth  pass,  1<  n,  of  the  inputs 
through  the  MPN,  the  switches  are  classified  into  3 
groups  as  follows:  ll)  if  a  switch  belongs  to  a  stage 
whose  label  is  smaller  than  k,  then  the  switch  con¬ 
nects  only  its  upper  2  inputs  to  its  upper  2  outputs 
and  ignores  its  lower  2  inputs,  (2)  if  a  switch 
belongs  to  the  £th  stage,  then  it  determines  its  state 
by  inspecting  the  kth  bit  of  the  destination  tags  of 
its  upper  2  inputs  in  the  same  way  as  a  switch  of 
PN  does  (the  lower  2  inputs  of  the  switch  are 
ignored),  and  (3)  if  a  switch  belongs  to  a  stage 
whose  label  is  greater  than  k,  then  the  switch  con¬ 
nects  its  pth  input,  1^  p<  4,  to  its  pth  output. 

Third  network:  This  network,  called  APN  and 
illustrated  m  Figure  3  for  jV» 8,  is  obtained  by 
adding  intrastage  links  among  the  switches  of  PN, 
multiplexers  and  demultiplexers  at  the  first  stage 
and  the  last  stage.  Each  2x4  switch  of  PN  is 
replaced  by  a  3x5  switch  shown  in  Figure  3  which 
contains  2  additional  (auxiliary)  links  called  chain- 
in  link  and  chain-out  link  [5|.  Chain-in  link  and 
chain-out  link  are  the  links  added  to  the  top  and  the 
bottom  of  a  switch,  respectively.  All  N / 2  switches 
at  stage  1  form  a  single  loop  by  making  use  of  these 
auxiliary  links.  All  those  switches  at  stage  k, 
!■<  k<L  n— 1,  which  receive  their  inputs  from  the 
same  concentrator  are  connected  to  each  other  by 
the  auxiliary  links  to  form  a  loop.  The  switches  at 
stage  it  do  not  need  these  auxiliary  links  because  the 
outputs  of  any  concentrator  at  this  stage  are  con¬ 
nected  to  only  a  single  switch.  After  adding  intras¬ 
tage  links  among  the  switches  of  the  PN,  multi¬ 
plexers  and  demultiplexers  are  also  added  in  the 
first  stage  and  the  last  stage,  respectively  to  make 
the  network  1-fault- tolerant.  Because  of  space  limi¬ 
tations,  the  proof  that  this  network  is  1-fault- 
tolerant  is  not  ineiuded  here  and  can  be  found  in 
(12|. 

APN  is  a  highly  robust  network  in  terms  of  its 
ability  to  tolerate  multiple  switch  and  link  faults 
simultaneously.  When  a  switch  has  a  fault,  the 
input  on  the  chain-in  link  bypasses  the  switch  and 
goes  to  the  next  switch  in  the  loop  via  the  chain-out 
link. 


5.  CONCLUSIONS 

Four  different  self-routing  permutation  net¬ 
works  have  been  presented.  One  of  these  self-routing 
permutation  networks,  called  PN,  is  constructed 
from  2x4  switches  and  concentrators  of  2logjV  gate- 
level  delay.  The  PN  uses  the  N- hyperconcentrator 
introduced  in  [6^  as  an  (N,N /2)-concentrator.  The 
PN  has  (?(log"fV)  gate-level  delay  and  uses 
Ofi^logiV)  VLSI-level  area.  The  propagation  time 
of  this  network  is  less  than  that  of  any  seif-routing 
permutation  network  proposed  in  the  literature.  In 
order  to  make  PN  1-fauit-tolerant,  intrastage  links 
among  switches  of  every  stage,  along  with  multi¬ 
plexers  and  demultiplexers  are  added.  Using  the 
advantage  of  recycling  the  inputs  back  a— 1  times,  a 
self-routing  permutation  network  is  obtained  from 
an  n-stage  IN  and  a  concentrator  with  2 N  inputs 
and  N  outputs.  The  problem  of  not  being  able  to 
implement  a  concentrator  in  a  single  chip  when  N 
gets  larger  has  been  resolved  using  short-circuit 
switches  with  queues. 

REFERENCES 

[1]  D.M.  Koppelman  and  A.Y.  Oruc,  “A  self¬ 
routing  permutation  network,”  J.  of  Parallel 
and  Distrib.  Comput.,  10,  1990,  pp.  140-151. 

[2]  D.E.  Knuth,  The  Art  of  Computer  Program¬ 

ming,  Vol.  S:  Sorting  and  Searching, 
Addison- Wesley,  1973. 

[3]  K.  Batcher,  “Sorting  networks  and  their 

applications,”  ir.  AFIPS  Spring  Joint  Com¬ 
puter  Conf.,  vol.  32,  1968,  pp.  307-314. 

[4]  M.  Ajtai,  J.  Komlos,  and  E.  Stemeredi,  “An 
O(nlogn)  sorting  network,”  in  Proc.  15th 
Annu.  ACM  Symp.  Theory  Comput.,  1983,  pp. 
1-9. 

[5]  N.  Tseng,  P.  Yew  and  C.  Zhu,  “A  fault- 
tolerant  scheme  for  multistage  interconnection 
networks,”  12th  Inti.  Symp.  Computer  Archi¬ 
tecture,  June  1985,  pp.  388-375. 

[8j  TJH.  Cormen  and  C.E.  Leiserson,  “A  hyper¬ 
concentrator  switch  for  rooting  bit-serial  mes¬ 
sages,"  Proc.  of  the  1986  Int'l  Conf.  on  Paral¬ 
lel  Proceeeinq,  1986,  pp.  721-728. 

[7]  D.H.  Lawrie,  “Access  and  alignment  of  data  in 
an  array  processor,”  IEEE  Trane.  Comput, 
vol.  C-24,  no.  12,  Dec.  1975,  pp.  1145-1155. 

[8]  Y.  Pan  and  R.  Melhem,  “Short  circuits  in 
buffered  multi-stage  interconnection  net¬ 
works,”  The  Computer  Journal,  vol.  33,  No.  4, 
1990,  pp.  323-329. 

[9j  C.  Jan  and  A.Y.  Oruc,  “Fast  self-routing  per¬ 
mutation  networks,”  1991  Int'l  Conf.  on 
Parallel  Proceeeing,  pp.  1-283 — 1-269. 

[10)  M.A.  Franklin,  “VLSI  performance  com¬ 
parison  of  banyan  and  crossbar  communica¬ 
tions  networks,”  IEEE  Trane,  on  Comput, 
vol.  C-30,  No.  4,  Apr.  1981,  pp.  283-291. 

[11]  F.T.  Thompson,  M.  Lepley  and  G.L.  Miller, 
“Layouts  for  the  shuffle-exchange  graph  based 
on  the  complex  plane  diagram,”  Siam  J.  Alg. 


1-246 


1 992  International  Conference  on  Parallel  Processing 


Due.  Meth.,  vol.  5,  No.  2,  June  1984,  pp. 
202-215. 

[l — J  Hasan  Cam,  “Design  and  permutation  routing 
algorithms  of  rearrangeable  networks,"  PK.D. 
Thetu,  Purdue  Univ.,  West  Lafayette,  1992. 


Table  1.  Comparison  of  the  performance  of  3  seif¬ 
routing  networks,  where  the  delay  and  cost  are  com¬ 
puted  in  gate-level  and  VLSI-levei,  respectively. 


Network 

Delay 

Cost 

Batcher’s  IN 

0  (log3  IV) 

0(N*logfV) 

Jan  &  Oruc’s  IN 

0(log3N) 

0(jV2loglV) 

PN 

D  (log2  IV) 

0{S*  logN) 

Figure  1.  The  first  iteration  of  the  recursive  con¬ 
struction  of  the  seif-routing  permutation  network 
called  PS. 


Figure  2.  The  seif-routing  permutation  network 
(PN)  with  iV  =  8.  The  letter  C  in  the  boxes  stands 
for  concentrator. 


Figure  3.  The  4  states  of  a  switching  box  in  PN. 
(a)  Straight  connection  when  no  conflict  exists,  (b) 
Cross  connection  when  no  conflict  exists,  fc)  Upper 
connection  when  both  control  bits  are  0.  fdj  Lower 
connection  when  both  control  bits  are  1. 


Figure  4.  The  modified  permutation  network 
MPN  with  1V=<»16  inputs/outputs. 


0 
1 

2 

3 

4 

5 

9 

7 

Stage  1  23 

Figure  5.  An  APN  with  S  —  8. 


1-247 


REFERENCE  NO.  18 


Saghi,  G.,  Siegel,  H.  J.,  and  Fortes,  J.  A.  B.,  "On  the  Viability  of  a  Quantitative 
Model  of  System  Reconfiguration  Due  to  a  Fault,"  1992  International  Conference  on 
Parallel  Processing,  August  17-21,  Sl  Charles,  Illinois,  Vol.  I,  pp.  233-242. 

Note  -  This  paper  proposes  a  quantitative  model  that  can  be  used  by  an  operating  sys¬ 
tem  to  decide  what  reconfiguration  strategy  should  be  used.  It  discusses  how  the 
costs  of  three  reconfiguration  schemes  may  be  quantifiable  in  a  realistic  setting. 


4  992  International  Conference  on  Parallel  Processing 


ON  THE  VIABILITY  OF  A  QUANTITATIVE  MODEL 
OF  SYSTEM  RECONFIGURATION  DUE  TO  A  FAULT 

Gene  Saghi.  Howard  Jay  Siegel,  and  Jose  A  B.  Fortes 

Parallel  Processing  Laboratory,  School  of  Electrical  Engineering 
Purdue  University,  West  Lafayette,  IN  47907-1283  USA 


Abstract  -  Dynamic  fault  reconfiguration  has  been 
proposed  for  use  in  large-scale  parddonobie  parallel 
processing  systems  with  distributed  shared  memory.  If  a 
processor  develops  a  permanent  fault  during  the  execu¬ 
tion  of  a  task  on  a  submachine  A,  three  recovery  options 
are  migration  of  the  task  to  another  submachine,  task 
migration  to  a  subdivision  of  A.  and  redistribution  of  the 
task  among  the  fault-free  processors  in  A  to  effectively 
'  ‘ disconnect ’  ’  the  faulty  processor.  Quantitative  models 
of  these  three  reconfiguration  schemes  are  developed  to 
consider  what  information  is  needed  to  make  a  choice 
among  these  methods  for  a  practical  implementation.  A 
multistage  cube  or  hypercube  inter-processor  network  is 
assumed.  It  will  be  pointed  out  that  in  certain  situations 
collecting  precise  values  for  all  needed  parameters  will 
be  very  difficult.  Thus,  the  motivation  for  this  research  is 
to  examine  to  what  extent  a  choice  of  reconfiguration 
strategies  is  quantifiable  in  a  realistic  setting. 

1.  INTRODUCTION 

If  massively  parallel  processing  systems  are  to  pro¬ 
vide  reliable  operation  over  extended  periods  of  tune, 
they  must  be  capable  of  tolerating  faults.  A  fanit-toierant 
system  must  be  able  to  detect  and  locate  faults,  to 
reconfigure  itself  to  “disconnect”  and  perhaps  replace 
faulty  components,  to  recover  Gram  possibly  erroneous 
computations,  and  to  restart  operation  Gram  a  correct 
state.  A  dependable  fault-tolerant  system  must  be  able  to 
both  operate  in  the  presence  of  Gaults  and  meet  the  perfor¬ 
mance  requirements  of  its  applications. 

One  approach  to  achieving  Gault  tolerance  in  parallel 
processing  systems  involves  the  use  of  redundant 
hardware.  When  a  faulty  component  is  detected,  the  sys¬ 
tem  is  reconfigured  in  such  a  way  tint  the  faulty  com* 
ponem  is  replaced  by  one  of  the  redmdant  components. 
An  example  of  such  a  system  is  MPP  [2].  MFP  is  a  128  * 
128  array  of  single  bit  processors  implemented  as  a  64  * 
32  array  of  2  row  *  4  column  VLSI  ICs.  A  redundant 
column  of  VLSI  ICs  (64  ICs)  is  provided  to  replace  any 
of  the  32  columns  of  ICs  within  which  a  Guilty  processor 
may  exist.  Thus,  any  single  processor  IC  failure  in  the 
array  can  be  tolerated.  Additional  hardwire  could  be 
added,  at  additional  cost,  to  allow  s  greater  number  of 


This  rwrch  «*»  mppowd  by  tba  Office  at  Naval  Reaaairb  outer 
grant  number  N0001 4-90-1- 1 4S3  and  wpported  by  St  frmovabve 
Sciatica  and  Technology  Office  at  the  Strnnpr  Dateae  Orgateiuirn 
»d  admaimamed  through  tba  Office  at  Naval  Rematch  andar  oconact 
number  000144*4-0723. 


faults  to  be  tolerated.  A  general  disadvantage  of  the 
redundant  hardware  approach  is  that  the  extra  hardware  is 
idle  until  a  fault  occurs. 

The  work  presented  here  focuses  on  partitiooable 
parallel  processing  systems  where  the  set  of  processors 
can  be  partitioned  to  fora  multiple  independent  sub- 
machines.  The  execution  of  a  parallel  program  on  a  sub- 
machineisdefincdasataak.  It  is  possible  to  achieve  Gault 
tolerance  in  such  a  system  by  utilizing  the 
reconfigurability  of  the  system  to  effectively  "discon¬ 
nect”  the  faulty  component.  For  example,  parallel  pro¬ 
cessing  systems  such  as  Intel  Cube  [15],  nCUBE  [13], 
IBM  RP3  [24],  and  PASM  [29,  8]  incorporate  partitioo¬ 
able  interconnection  networks  and  therefore  have  the 
ability  to  migrate  a  task  from  a  bully  submachine  to  a 
fault-free  submachine  [e.g.,  27].  Such  an  approach  has 
the  advantage  of  no  redundant  hardware  costs;  however, 
the  tool  available  system  resources  are  decreased  when  a 
fault  occurs.  Furthermore,  if  the  smallest  possible  sub¬ 
machine  in  such  a  system  consists  of  more  than  one  pro¬ 
cessor,  the  fault-free  processors  in  the  faulty  submachine 
become  idle.  Thus,  there  is  a  trade-off  between  redundant 
hardware  cost  and  degraded  system  performance  for  the 
fault  tolerance  schemes  discussed. 

The  architecture  assumed  here  implements  a  physi¬ 
cally  distributed  memory  such  that  each  processor  is 
paired  with  local  memory  to  farm  a  processing  element 
(EE).  Most  existing  large-scale  parallel  processing  sys¬ 
tems  use  a  physically  distributed  memory  approach  (e.g., 
BBN  Butterfly  [3],  Connection  Machine  CM-2  [32],  Intel 
Cube  [13],  nCUBE  [I3L  DAP  [14],  MasPar  [4],  IBM 
RP3  [24]).  These  systems  implement  either  a  logically 
nonhared  memory  system,  a  logically  shared  memory 
system,  or  a  hybrid  of  the  two  memory  systems.  In  a  logi¬ 
cally  nonsfaared  memory  system  (e.g^  CM-2,  MasPar), 
processors  cannot  access  remote  memory  locations 
directly,  timead.  all  communication  between  PEs  is 
through  explicit  message  passing.  In  a  logically  shared 
memory  system  (e^  BBN  Butterfly),  all  system  memory 
appears  in  the  address  space  of  each  processor.  Accesses 
to  memory  locations  located  in  a  remote  processor's 
memory  requires  use  of  the  interconnection  network.  As 
a  result,  remote  memory  accesses  incur  a  larger  latency 
than  local  memory  references.  Careful  placement  of  pro¬ 
gram  code  and  data  is  one  way  to  reduce  the  effects  of 
this  network  latency.  In  one  type  of  hybrid  memory  sys¬ 
tem  (c-gy  IBM  RP3),  a  portion  of  each  processor’s 
memory  is  reserved  for  nonshared  access,  the  remainder 
is  treated  as  logically  shared  memory,  and  all  mterproces- 


1-233 


J23 2  international  Conference  on  Parallel  Processing 


sor  communication  is  through  the  shared  memory.  The 
work  in  this  paper  assumes  a  logically  shared  or  hybrid 
memory  system. 

Previous  research  on  fault  recovery  by  dynamic 
reconfiguration  includes  [33]  and  [34].  These  papers 
explored  task  redistribution  in  a  MIMD  environment  to 
recover  from  a  PE  fault  for  near-neighbor-class  problems. 
Other  work  has  considered  the  reconfiguration  of  hyper¬ 
cube  architectures  in  the  evert-  of  PE  failure  [12.  20]. 
Here,  quantitative  models  of  three  different 
reconfiguration  schemes  are  developed  to  consider  what 
information  is  needed  to  make  a  choice  among  these 
methods  for  practical  implementations,  such  as  mission 
critical  situations  (e.g.,  automated  defense  systems).  It 
will  be  pointed  out  that,  given  today’s  technology,  in  cer¬ 
tain  circumstances  collecting  precise  values  far  all 
needed  parameters  will  be  very  difficult  Thus,  the 
motivation  for  this  research  is  to  examine  to  what  extent  a 
choice  of  reconfiguration  strategies  is  quantifiable  in  a 
realistic  setting. 

The  system  and  fault  models  used  to  analyze  fault- 
recovery  options  are  described  in  Section  2.  Section  3 
presents  recovery  options  and  associated  costs.  One  cost 
that  must  be  considered  for  every  fault  recovery  option  is 
the  remaining  execution  time  of  the  task  that  was  on  the 
faulty  submachine  when  the  fault  occurred.  As  an  exam¬ 
ple  of  the  difficulties  in  determining  some  of  these  costs. 
Sections  4  and  5  provide  models  for  determining  the 
remaining  execution  time  of  a  task  depending  on  the  sys¬ 
tem  configuration  and  operation  modes.  In  Section  4, 
tasks  whose  execution  times  are  data  independent  are 
considered,  while  in  Section  5  tasks  whose  execution 
dmes  are  data  dependent  are  considered. 

2.  SYSTEM  MODEL 

Two  approaches  to  massively  parallel  computation 
are  the  SIMD  and  MIMD  modes  of  parallelism  [9],  In 
machines  that  operate  in  MIMD  mode,  the  PEs  contain 
programs  and  data,  and  each  PE  functions  independently. 
In  machines  that  operate  in  SIMD  mode,  instructions  are 
broadcast  to  the  PEs,  the  PEs  contain  only  data,  and  all 
the  PEs  operate  in  lock-step  fashion.  A  multiple-SIMD 
machine  is  an  SIMD  machine  that  can  operate  as  one  or 
more  independent  SIMD  machines  [22].  A  mixed-mode 
machine  can  operate  in  either  the  SIMD  or  MIMD  mode 
of  parallelism  and  can  switch  modes  at  instruction  level 
granularity  [8].  A  partirionable  SIMD/MIMD  machine 
can  operate  as  one  or  more  independent  or  cooperating 
submachine*.  where  each  submachine  may  operate  as  a 
mixed-mode  machine  [30]. 

The  analyses  here  can  be  used  to  model  MIMD, 
multiple-SIMD,  or  parubonabie  SIMD/MIMD  parallel 
processing  systems,  utilizing  a  multistage  cube  or  hyper¬ 
cube  interconnection  network,  and  possessing  a  logically 
shared  or  hybrid  memory  system.  The  research  assumes 
a  parubonabie  SIMD/MIMD  machine  with  a  multistage 
cube  network  and  can  be  directly  applied  to  the  other 
cases. 


It  is  assumed  that  overall  system  activities  are  super¬ 
vised  by  a  dedicated  processor  known  as  the  system  con¬ 
troller  unit  (SCUT  although  it  could  be  a  program  distri¬ 
buted  among  system  processors.  Among  other  duties,  the 
SCU  is  responsible  for  allocating  and  deallocating  sub- 
machines,  and  for  determining  the  proper  recovery  action 
in  the  event  of  a  detected  fault.  The  activities  in  each  sub- 
machine  are  supervised  by  a  submachine  controller  (5 Q. 
Similar  to  the  SCU,  the  SC  is  assumed  to  be  a  rWignntwi 
processor,  although  it  could  be  a  program  distributed 
among  the  processors  of  the  submachine. 

The  execution  of  an  SIMD  procedure  on  a  sub¬ 
machine  is  an  SIMP  process.  An  MIMD  process  is  the 
execution  of  an  MIMD  procedure  on  a  PE  that  is  part  of 
the  submachine.  The  term  suhtask  refers  to  a  single 
thread  of  control  (stream  of  instructions).  A  subtask  can 
be  composed  of  one  or  more  SIMD  processes  executed 
sequentially  on  the  same  submachine,  or  one  or  more 
MIMD  processes  executed  sequentially  on  the  same  sub¬ 
machine  PE.  A  task  can  be  composed  of  one  or  more  sub¬ 
casks.  Tasks  are  coded  assuming  that  any  number  of  vir¬ 
tual  processors  required  is  available.  At  execution  time, 
the  SC  determines  the  number  of  physical  processors 
available  in  the  submachine  and  maps  the  virtual  proces¬ 
sors  onto  physical  processors.  The  ratio  of  virtual  proces¬ 
sors  to  physical  processors  is  called  the 
virtual  processor  ratio.  When  the  virtual  processor  ratio 
is  greater  than  one,  some  or  all  of  the  physical  processors 
will  perform  the  functions  of  more  than  one  virtual  pro¬ 
cessor.  If  the  virtual  processor  ratio  is  less  than  one,  some 
physical  processors  will  remain  idle.  Use  of  a  virtual  pro¬ 
cessor  scheme  allows  tasks  to  be  executed  on  sub- 
machines  of  various  sizes  without  having  to  be  re¬ 
compiled.  The  Connection  Machine  CM-2  [32]  uses  such 
a  virtual  processor  scheme  to  allow  programs  to  be  exe¬ 
cuted  on  machines  consisting  of  different  numbers  of 
physical  processors.  Other  schemes  that  allow  for  the 
same  kind  of  functionality  are  equally  applicable  to  this 
research. 

The  model  used  far  fault  tolerance  and  recovery  is  as 
follows.  At  regular  intervals  during  the  execution  of  a 
task,  the  intermediate  results  of  each  PE  are  stored  in  a 
different  PE  within  the  same  submachine  (i-e.,  a  permuta¬ 
tion  of  the  PEs  in  the  submachine).  These  results  are 
called  checkpoint  data  and  they  will  be  used  to  restore  a 
valid  system  state  in  the  event  of  a  fault  Error-recovery 
techniques  using  checkpointing  are  discussed  in  [10].  It 
is  assumed  that  one  or  more  existing  fault  detection  tech¬ 
niques  in  the  literature  (e.g.,  [7]  or  [1])  is  used  in  the  sys¬ 
tem  to  detect  faulty  components.  The  faults  of  interest  are 
permanent  faults  that  affect  the  processor  of  the  PE  or  the 
memory  module  of  the  PE.  Transient  faults  are  not  con¬ 
sidered  in  this  paper.  For  this  study  it  is  assumed  that  a 
faulty  processor  is  unable  to  either  compute  or  communi¬ 
cate  with  other  PEs,  but  does  not  interfere  with  the  opera¬ 
tion  of  fault-free  PEs.  Furthermore,  the  local  memory  of  a 
faulty  processor  is  assumed  to  be  corrupt,  or  inaccessible. 


1-234 


■992  International  Conference  on  Parallel  Processing 


The  failure  of  a  processor  local  data  bus  or  arithmetic 
logic  unit  would  be  examples  of  faults  classified  as  pro¬ 
cessor  faults.  When  a  memory  module  is  faulty,  it  is 
issamed  that  .neither  the  local  processor  nor  any  other  PE 
can  reliably  read  Jala  from  or  write  data  to  the 
faulty  memory  module.  The  local  processor  is  assumed  to 
x  fully  operational  m  every  other  way.  Such  would  be 
‘he  case  if  there  was  a  failure  in  the  memory  module 
refresh  circuitry  or  a  memory  module  I/O  buffer,  far 
example.  APE  with  a  faulty  processor  or  a  faulty  memory 
module  will  henceforth  be  referred  to  as  a  faulty  PE, 
unless  it  is  necessary  to  distinguish  between  the  two  fault 
types.  When  a  fault  is  detected,  the  SCU  determines  and 
directs  the  proper  recovery  action  after  which  processing 
continues  from  the  last  valid  checkpoint 

3.  RECOVERY  OPTIONS  AND  COSTS 

When  a  permanent  fault  occurs  in  a  submachine  A, 
three  possible  reconfiguration/recovi  -  options  are  as  fol¬ 
lows. 

1)  Subdivide  A  into  two  equal-size  system  submachines, 
and  use  the  one  that  is  fault-free  to  complete  the  exe¬ 
cution  of  the  task. 

2)  Migrate  the  task  to  another  submachine  that  is  fault- 
free. 

3)  Redistribute  the  cask  programs  and  data  among  the 
fault-free  PEs  in  A  and  complete  the  task  using  a 
modified  algorithm  that  does  not  use  the  faulty  PE. 

These  recovery  options  are  discussed  further  in  the  fol¬ 
lowing  subsections. 

If  a  PE  memory  module  fault  occurs,  another 
reconfiguration/recovery  option  is  possible.  Using  infor¬ 
mation  from  secondary  storage  or  checkpoint  data,  the 
process  that  was  executing  on  that  PE  can  be  loaded  into 
the  memory  modules  of  other  PEs  in  submachine  A.  Exe¬ 
cution  then  continues  as  before,  but  with  the  processor 
associated  with  the  faulty  memory  module  accessing  only 
remote  memory.  This  option  was  considered  in  the 
research  described  here,  but  could  not  be  included 
because  of  space  limitations.  Results  of  that  work  can  be 
found  in  [25]. 

In  addition  to  the  costs  of  recovery  discussed  far 
each  option,  there  is  a  time  overhead  of  determining  these 
recovery  costs  to  select  the  best  option  for  a  given  situa¬ 
tion.  However,  this  overhead  is  incurred  prior  to  the  ini¬ 
tiation  of  any  of  these  recovery  schemes,  so  it  is  separate 
from  the  cost  of  recovery  and  is  not  included  in  the  fol¬ 
lowing  subsections. 

3.1.  Task  Completion  on  a  Fault-Free  Subdivision 

When  a  PE  fault  occurs  on  a  dynamically  partition- 
able  system,  it  can  be  avoided  by  subdividing  the  current 
submachine  and  completing  the  task  on  the  fault- free  sub¬ 
division.  This  process  would  proceed  as  follows.  Once 
the  PE  fault  has  been  located  and  the  recovery  point  for 
the  task  has  been  determined,  the  SCU  must  make  a  deter¬ 
mination  about  the  partitionability  of  the  current  sub¬ 
machine.  For  this  option,  it  is  assumed  that  algorithmic 


constraints  dictate  that  the  subdivision  is  required  »  pos¬ 
sess  the  same  topological  network  properties  as  the  origi¬ 
nal  submachine.  The  recovery  option  of  Section  3.3  is 
used  when  there  is  no  need  to  preserve  these  proper  ties. 
To  create  independent  submachines  with  the  same  topo¬ 
logical  network  properties  as  the  original  network,  all 
submachines  in  a  system  possessing  a  multistage  cube  or 
hypercube  interconnection  network  must  have  sizes  that 
are  a  power  of  two  [28, 30].  Thus,  if  a  submachine  can  be 
partitioned  (Le.,  it  is  not  of  minimum  size),  it  can  be  parti¬ 
tioned  into  two  equal-size  submachines.  For  a  machine 
with  N  =  2“  PEs,  numbered  from  0  to  N- 1,  the  numbers 
of  all  PEs  in  a  submachine  of  size  K=2k  agree  in  n-k  bit 
positions.  These  n-k  bits  are  called  submachine  bits. 
Without  loss  of  generality,  let  these  submachine  bits  be 
the  low-order  n-k  bits.  The  fault-free  subdivision  (half)  B 
of  this  submachine  A  is  composed  of  K/2  PEs.  B’s  sub¬ 
machine  bits  are  calculated  by  appending  the  complement 
of  bit  n-k+1  of  the  faulty  PE  number  to  the  high-order 
end  of  submachine  A’s  submachine  bits.  Thus,  to  deter¬ 
mine  the  submachine  bits  of  the  fault-free  subdivision 
(FFS3  requires  a  constant  amount  of  timeTrSL. 

The  number  of  physical  processors  in  the  subdivi¬ 
sion  is  half  the  number  of  physical  processors  in  the  origi¬ 
nal  submachine.  Therefore,  the  virtual  processor  ratio  far 
the  subdivision  will  double.  The  SC  must  determine  the 
new  mapping  of  virtual  processors  onto  physical  proces¬ 
sors  and  coordinate  the  relocation  of  program  code  and 
data  accordingly.  TEE*.  the  time  to  perform  this  map¬ 
ping,  is  computable  by  the  SC  and  will  depend  on  the 
implementation  specifics  of  virtual  processors  in  the  sys¬ 
tem. 

The  next  step  is  to  move  all  the  task  program  code 
and  data  onto  the  fault- free  subdivision  as  per  the  virtual 
processor  to  physical  processor  mapping  determined 
above.  This  will  require  TtS*.  time.  The  amount  of  code 
a  task  consists  of  can  be  easily  determined  during  task 
compilation.  However,  the  amount  of  data  associated 
with  a  task  can  change  dynamically  during  task  execu¬ 
tion.  For  this  reason,  part  of  the  information  saved  during 
every  checkpoint  operation  is  the  amount  of  checkpoint 
data  saved.  Because  every  PE  stores  its  checkpoint  data 
into  a  different  PE  within  die  same  submachine,  no  data  is 
lost  when  a  PE  becomes  faulty.  It  is  assumed  that  the 
interconnection  network  is  used  to  transfer  data.  This 
transfer  can  be  accomplished  without  conflicting  with 
inter-PE  messages  from  other  submachines  because  of 
the  partitioning  properties  of  the  network.  The  time  to 
accomplish  the  data  transfer  can  therefore  be  determined 
and  will  depend  on  the  amount  of  data  and  the  virtual  pro¬ 
cessor  to  physical  processor  mapping. 

In  die  SPMD  (Single  Program  -  Multiple  Data)  res¬ 
triction  of  MIMD  mode  [71,  every  PE  in  the  submachine 
has  a  copy  of  the  same  program.  Thus,  in  SPMD  mode, 
program  code  does  not  have  to  be  transferred  and  TfSL, 
will  consist  only  of  the  time  required  to  transfer  the  data. 
However,  for  SIMD  and  MIMD  modes,  some  program 
code  will  have  to  be  transferred. 


1-235 


1992  'nternational  Conference  on  Parallel  Processing 


The  structure  of  SC/PC  and  inter-SC  connections  (if 
they  exist)  varies  among  approaches  to  multiple-SIMD 
architectures  (e.g.,  CM-2  [32],  MAP  [22],  PASM  [29], 
TRAC[19]).  Because  of  this,  the  time  to  do  any  needed 
transfer  of  SIMD  programs  when  “moving”  a  SIMD  task 
from  one  set  of  PEs  to  a  subset  or  different  set  is  highly 
machine  dependent  Thus,  in  the  discussions  that  follow, 
it  will  be  assumed  as  an  upper  bound  time  requirement 
that  the  SIMD  program  is  reloaded  from  secondary 
storage  whenever  a  reconfiguration  occurs. 

For  general  MIMD  mode  tasks,  program  code  will 
have  to  be  moved  from  PEs  not  in  the  subdivision  to  those 
that  are  in  the  subdivision.  Furthermore,  depending  on 
the  mapping  of  virtual  to  physical  processors,  program 
code  may  have  to  be  moved  among  PEs  in  the  subdivi¬ 
sion.  The  interconnection  network  is  used  to  perform  this 
transfer  of  program  code  and  the  time  to  do  this  can  be 
determined  in  the  same  way  as  for  the  transfer  of  data 
across  the  network.  However,  the  MIMD  programs  that 
resided  on  the  faulty  PE  will  have  to  be  loaded  from 
secondary  storage  (disk). 

The  time  to  transfer  a  SIMD  or  MIMD  program 
from  disk  depends  on  the  disk  latency,  the  amount  of  code 
to  be  transferred,  and  the  bandwidth  of  the  disk  communi¬ 
cation  channel.  In  such  cases.  TtSL  is  difficult  to  predict 
accurately  because  of  variation  in  disk  latency  from  one 
access  to  another.  Therefore,  an  expected  time  for  disk 
access  will  have  to  be  used  to  determine  Tt£L.. 

The  SCU  must  then  perform  whatever  operating  sys¬ 
tem  functions  are  needed  to  establish  the  new  system  par¬ 
titioning.  This  requires  a  system  dependent  time 
represented  by  Tp2-  The  time  to  complete  the  task  exe¬ 
cution  on  the  subdivision  is  Tr^a.^  and  will  generally 
be  greater  than  the  time  to  complete  the  task  execution  on 
the  original  submachine.  The  difficulty  of  determining 
Tq^Emc  for  this  recovery  option  and  those  that  follow  is 
discussed  in  Sections  4  and  5.  TfSi,  the  total  time  to 
reconfigure  and  complete  a  task  on  a  subdivision  of  the 
original  submachine  is  represented  as  follows. 

tFFS  _  tHPs  ,  <t*FFS  ,  <t»FFS  ,  TfFS  ,  -t'FFS 

I  Total  -  1  Chock  •+•  t  Tm*£r  +  *  Pm  +  ICmpEue 

There  are  three  main  disadvantages  of  the  subdivi¬ 
sion  recovery  method.  The  first  is  that  a  subdivision  of 
the  current  submachine  may  not  exist,  i.e.,  there  is  no 
way  to  further  subdivide  the  current  submachine.  This  is 
due  to  restrictions  on  minimum  submachine  size,  which 
is  usually  associated  with  SIMD  operation.  The  second 
disadvantage  is  similar  and  follows  from  this.  There  is  a 
limit  to  how  many  faults  can  be  tolerated  in  this  way 
because  of  the  physical  limits  to  the  number  of  times  a 
machine  can  be  partitioned.  Finally,  two  types  of  perfor¬ 
mance  degradation  may  occur.  The  first  is  task  perfor¬ 
mance  degradation.  A  task  will  generally  require  more 
time  to  complete  on  a  subdivision  of  the  original  sub¬ 
machine.  This  is  further  discussed  in  Section  S.  The 
second  type  of  performance  degradation  is  that  of  the 
system.  After  subdividing  a  submachine  because  of  a 
fault,  the  entire  subdivision  containing  the  faulty  PE 


becomes  idle.  In  MIMD  mode,  the  fault-free  PEs  of  this 
subdivision  can  be  utilized  by  the  system  because  the 
PEs  act  independently  of  one  another.  However,  in 
SIMD  mode,  the  system  must  repeatedly  partition  the 
suodi vision  containing  the  fault  until  the  fault  is  in  a  sub¬ 
machine  of  minimum  size.  Other  tasks  can  be  executed 
on  the  fault-free  submachines  created  during  this  parti- 
boning  process.  If  the  minimum  size  of  the  submachine 
containing  the  fault  is  greater  than  one,  fault-free  PEs 
become  underutilized  and  thus  contribute  to  a  degrada¬ 
tion  in  system  performance. 

3 2.  Task  Migration 

It  is  likely  that  some  tasks  executing  on  a  massively 
parallel  processor  will  not  use  the  entire  machine  and  will 
therefore  execute  on  an  independent  submachine  formed 
by  partitioning  the  machine.  Reasons  for  partitioning 
include  allowing  multiple  users  (spatially),  improving 
system  utilization,  and  improving  task  execution  time 
(some  tasks  can  execute  faster  using  fewer  PEs)  [28. 31], 
Therefore,  if  one  or  more  PE  faults  occur  in  the  current 
(source)  submachine  P„  the  task  can  be  migrated  to  a 
another  (destination)  submachine  Pd,  if  one  is  available. 
It  is  assumed  that  P,  and  Pd  are  of  equal  size.  While  this 
may  not  be  a  requirement  for  some  systems,  it  is  a  reason¬ 
able  assumption  to  make  because  P,  was  originally  of  an 
appropriate  size  for  the  task.  A  brief  analysis  of  the  task 
migration  costs  discussed  in  [26]  is  provided  below. 

The  first  step  to  be  made  during  the  migration  of  a 
task  from  P,  to  Pd  is  to  decide  to  which  Pd  to  migrate. 
When  more  than  one  Pa  is  available,  the  Pd  that  results  in 
the  least  cost  of  migration  should  be  chosen.  Two  factors 
that  enter  into  this  decision  are  the  locations  of  P,  ana  Pd 
and  the  mapping  of  P,  PEs  to  Pd  PEs.  For  a  N = 2"  PE 
machine,  a  0(n)  time  method  for  determining  the  map¬ 
ping  of  PEs  from  P,  to  Pd  that  minimizes  the  task  migra¬ 
tion  (TM)  time  is  provided  in  [27].  Thus,  Tjj^,  the  time 
to  choose  Pd  and  to  determine  the  optimal  mapping  of 
source  PEs  to  destination  PEs  is  a  function  of  the  log  of 
the  size  of  the  machine  and  the  number  of  destination 
submachines  to  be  considered.  The  SCU  performs  this 
function,  because  it  is  responsible  for  allocating  and  deal¬ 
locating  submachines. 

The  next  step  in  the  task  migration  process  is  to 
transfer  ail  necessary  task  information  from  P,  u>Pd.  The 
time  to  accomplish  the  data  transfer  depends  on  a  combi¬ 
nation  of  the  amount  of  data  to  be  transmitted,  the  loca¬ 
tion  of  source  and  destination  submachines,  the  source  PE 
to  destination  PE  mapping,  the  use  of  the  interne  j  xoon 
network  by  other  tasks,  the  type  of  network,  and  system 
implementation  details.  It  is  assumed  that  the  intercon¬ 
nection  network  is  used  to  transfer  data  for  the  SIMD 
case,  and  programs  and  data  for  all  other  cases.  U  is  also 
assumed  that  there  is  no  interference  during  the  transfer 
from  other  tasks  because  it  is  expected  (hat  that  simul¬ 
taneous  task  migrations  would  rarely  occur.  Furthermore, 
interference  from  normal  inter-PE  message  traffic  gen¬ 
erated  by  a  task(s)  executing  on  a  submachine(s)  that  the 


1-236 


1 992  International  Conference  on  Parallel  Processing 


task  migrates  through  is  additive  in  nature  and  very  lim¬ 
ited  relative  to  the  task  migration  traffic. 

Let  Trilf.  be  the  tune  to  transfer  the  program  and 
jam.  For  multistage  cube  interconnection  networks  and 
with  the  exception  of  SIMD  and  general  MIMD  program 
code,  the  time  to  perform  this  transfer  is  a  linear  function 
of  the  amount  of  data  to  be  transferred  and  the  number  of 
network  permutation  settings  required  to  accomplished 
the  transfer.  The  amount  of  time  to  transfer  the  SIMD 
program  code  to  is  machine  dependent.  In  the  worst 
case,  the  code  will  have  to  be  loaded  from  secondary 
storage  into  the  SC  of  P*.  For  the  general  MIMD  case, 
the  program  for  the  faulty  PE  must  also  be  transferred 
from  secondary  storage.  As  discussed  earlier,  this  access 
to  secondary  storage  makes  prediction  of  the  transfer  time 
difficult 

As  before.  is  the  time  to  complete  the  task 

execution  once  the  migration  is  complete.  will 

be  equivalent  to  the  time  required  to  complete  the  task 
from  the  recovery  point  on  the  original  submachine 
before  the  fault  occurred.  Thus,  t|oLi.  the  time  to 
migrate  a  task  is  as  follows. 


The  main  advantages  of  task  migration  to  another  equal* 
size  submachine  are  that  no  remapping  of  virtual  proces¬ 
sors  to  physical  processors  is  required  and  once  the 
migration  is  completed,  the  task  will  finish  executing 
without  any  performance  degradation.  However,  task 
migration  is  not  possible  if  another  submachine  is  not 
available.  Also,  the  overhead  to  move  a  task  may  be 
significant  compared  to  the  task  execution  tune  remain¬ 
ing.  As  with  the  fault-free  subdivision  option,  system 
degradation  in  the  form  of  underutilization  of  fault-free 
PEs  may  occur  for  the  PEs  in  P,. 


3.3.  Task  Redistribution 


In  many  cases  when  1  PE  or  PE  memory  fault 
occurs,  it  may  be  possible  to  redistribute  the  data  and/or 
subtasks  associated  with  the  faulty  PE  to  neighboring  PEs 
within  the  same  submachine,  thus  effectively  removing 
the  faulty  PE  from  the  computation.  The  ability  to  redis¬ 
tribute  the  task  in  such  a  way  that  the  faulty  PE  is  effec¬ 
tively  removed  from  the  computation  depends  on  a 
number  of  factors  including  the  algorithm,  the  mode  of 
operation,  and  the  interconnection  network.  Consider  an 
algorithm  that  exhibits  coarse  grain  parallelism  and  is 
implemented  as  a  set  of  MIMD  subtasks  that  require  little 
inter-subtask  communication.  If  a  PE  becomes  faulty,  the 
subtask  that  was  executing  on  that  PE  can  be  distributed 
to  one  or  more  fault-free  PEs.  However,  to  minimize  the 
execution  time  of  the  task,  load  balancing  may  be 
required.  Work  on  load  balancing  on  parallel  and  distri¬ 
buted  systems  [e.g„  21]  has  shown  that  the  optimal  load 
balancing  solution  depends  on  a  number  of  factors  includ¬ 
ing  the  task/subtask  queuing  model  used,  the  precedence 
constraints  involved,  and  on  the  tasks  being  executed. 
Thus,  it  is  unlikely,  using  available  technology,  that  a  dis¬ 


tribution  solution  can  be  found  that  works  for  all  classes 
of  problems.  Furthermore,  there  is  a  trade-off  between 
the  time  it  takes  to  distribute  the  load  and  the  savings  in 
execution  time  that  results. 

Consider  an  SIMD  machine  with  a  mesh  intercon¬ 
nection  network.  A  faulty  PE  might  require  that  all  the 
data  on  aQ  the  PEs  be  considered  in  the  redistribution  to 
achieve  optimal  performance  and  preserve  the  near¬ 
neighbor  communication  pattern  [34].  If  a  multistage 
cube  network  is  used,  it  is  possible  that  the  network  will 
not  be  able  to  support  the  required  single-pass  inter-PE 
communication  permutations  on  a  submachine  where  one 
or  more  PEs  have  been  effectively  removed  (Le.,  the  per¬ 
mutations  permitted  ate  limited  [17]).  In  general,  there 
may  be  wVtirinnal  overhead  incurred  whether  extra  time 
is  required  to  redistribute  the  data  or  extra  time  is  needed 
to  perform  communication  in  a  network  where  a  needed 
permutation  cannot  be  performed  in  one  step  due  to  a  PE 
fault 

The  first  step  in  the  task  redistribution  (TR)  process 
is  co  determine  how  to  accomplish  the  redistribution.  In 
general,  an  equal  load  distribution  (of  data  for 
SIMD/SPMD  task*  and  of  both  programs  and  data  for 
MIMD  tasks)  among  all  of  the  fault-free  PEs  will  result  in 
the  greatest  efficiency  for  completion  of  the  task.  The 
time  to  make  the  redistribution,  T&p.  depends  on  the 
mode  of  operation,  the  algorithm  mapping,  and  the 
current  state  of  each  subtask. 

It  is  assumed  that  the  SC  has  user-supplied 
knowledge  about  the  algorithm  mapping,  which  it  uses  to 
decide  how  to  redistribute  the  task.  Ideally,  this 
knowledge  would  be  compiler-supplied.  However,  while 
this  may  be  possible  for  certain  classes  of  problems,  it  is 
still  an  open  problem  in  the  general  case. 

The  next  step  is  to  perform  the  subtask  and/or  data 
redistribution.  t|L&  is  the  time  required  to  accomplish 
the  relocation  of  the  program  code  and/or  data.  As  in  the 
previous  two  options,  the  time  to  accomplish  this  transfer 
depends  on  the  mode  of  operation,  the  amount  of  data  to 
be  transferred,  the  location  of  source  and  destination  PEs, 
the  interconnection  network,  and  system  implementation 
Once  again,  for  a  MIMD  task,  the  transfer  of  pro¬ 
gram  code  from  secondary  storage  will  add  some  unpred¬ 
ictability  into  the  expected  transfer  time.  The  time  to 
complete  task  execution  after  task  redistribution, 
TlR[ - is  expected  to  be  greater  than  the  time  to  com¬ 

plete  the  task  on  the  fault- free  submachine  and  is  dis¬ 
cussed  further  in  Sections  4  and  5.  Thus,  Tf£lt.  the  time 
to  redistribute  a  task  among  the  fault-free  PEs  of  a  sub¬ 
machine  and  complete  execution  of  the  task  is  as  follows. 


An  advantage  of  the  task  redistribution  option  is 
that  all  the  fault-free  PEs  in  the  submachine  can  be  util¬ 
ized  by  the  task,  while  with  the  fault-free  subdivision 
option,  only  half  the  number  of  original  PEs  are  utilized. 
A  disadvantage  is  that  to  allow  data  and  subtasks  to  be 
redistributed  with  a  minimum  of  effort,  the  task  must  be 


1-237 


1992  international  Conference  on  J~r?.llel  Processing 


coded  and/or  compiled  considering  fault  tolerance  and 
reconfiguration.  As  an  example,  linked  lists  should  be 
favored  over  indexed  arrays  when  random  access  to  the 
list  elements  is  infrequent,  because  the  code  does  not 
have  to  know  the  number  of  elements  in  the  structure  and 
does  not  have  to  be  modified  when  the  data  is  redistri¬ 
buted.  Of  the  three  options  for  reconfiguration  discussed, 
the  task  redistribution  option  is  by  far  the  most  difficult  to 
quantify.  However,  if  the  task  redistribution  option  is  to 
be  considered,  the  information  discussed  in  this  section  is 
needed  to  choose  the  best  alternative. 

4.  DATA-INDEPENDENT  MODEL 

One  recovery  cost  common  to  all  the  options 
presented  is  the  remaining  execution  time.  However,  the 
time  required  to  complete  execution  of  a  task  varies  with 
the  recovery  option.  For  example,  a  task  that  has  nearly 
completed  executing  may  complete  much  earlier  on  a 
subdivision  than  it  would  if  it  were  migrated  to  another 
submachine,  especially  if  a  high  overhead  migration  cost 
is  incurred.  Alternatively,  if  a  significant  amount  of  com¬ 
putation  remains,  migrating  the  task  may  result  in  an  ear¬ 
lier  task  completion  time.  Thus,  the  remaining  task  exe¬ 
cution  time  becomes  important. 

Tasks  can  be  divided  into  two  categories  based  on 
execution  time.  These  are  tasks  with  data- independent 
execution  times  and  tasks  with  data-dependent  execution 
times.  A  task  with  a  data-independent  execution  time 
does  not  depend  on  input  data  to  make  branching  deci¬ 
sions.  Thus,  the  number  of  times  any  branch  in  the  task 
program  code  is  taken  can  be  determined  by  a  compiler 
during  program  compilation.  A  task  with  a  data- 
dependent  execution  time,  on  the  other  hand,  has  branch 
decisions  that  are  based  on  data  that  is  known  only  at  run 
time.  The  execution  time  of  such  tasks  is  considered  in 
the  next  section. 

A  model  is  presented  in  this  section  for  determining 
the  execution-time  costs  of  data-independent-execution- 
time  tasks.  This  model  differs  from  other  models  in  that  it 
can  accommodate  a  logically  shared  or  hybrid  memory 
system.  Such  an  execution-time  model  must  necessarily 
include  detail  about  the  amount  of  time  spent  making 
local  and  remote  memory  accesses.  Both  SIMD  and 
VflMD  versions  of  the  model  are  presented.  Because 
SPMD  is  a  subset  of  MEMD,  it  is  accommodated  by  the 
MIMD  model.  Mixed-mode  operation  execution-time 
prediction  can  be  accomplished  by  extending  the  model 
to  account  for  switching  between  modes. 

4,1,  SIMD  Task  Execution  Time 

In  SIMD  mode,  it  is  assumed  that  instructions  to  be 
executed  by  the  submachine  PEs  are  loaded  into  an 
automatic  broadcast  queue  by  the  SC.  This  leaves  the  SC 
free  to  perform  computations  of  its  own  (e.g.,  the  manipu¬ 
lation  of  loop  index  variables,  PE-common  array  index 
calculations).  Thus,  the  execution  tune  for  an  SIMD  task 
includes  the  SC  execution  time  plus  the  PE  execution 
time  minus  any  overlap  between  the  two.  For  a 


comprehensive  model  for  determining  the  amount  of 
SC/PE  overlap  in  an  SIMD  task  see  [16].  Let  Tjc  be  the 
time  the  SC  spends  performing  its  calculations.  T^  be 
•Jie  total  SC/PE  overlap,  and  Tp^  be  the  execution  time  of 
instruction  i  on  processor  P.  For  a  disabled  PE,  Tpjp  =0. 
In  SIMD  mode,  a  new  instruction  is  broadcast  to  the  PEs 
by  the  SC  only  after  the  current  instruction’s  execution 
has  been  completed  (or  initiated,  if  prefetching  is  used). 
Therefore,  TSIMDl  the  overall  time  for  a  SIMD  task  con¬ 
sisting  of  the  execution  of  I  instructions  on  Q  PEs, 
O^PsQ-l.is 

Tswd=Tsc  -  Tol  +  £nrax  [^3?  ] . 

Let  tt  be  the  time  required  for  a  PE  id  make  a  single 
access  to  its  local  memory.  It  is  assumed  that  any  such 
access  takes  a  constant  amount  of  time.  Similarly,  let  t* 
be  the  average  time  required  to  make  ait  access  to  a 
remote  (non-local)  memory  location. 

For  the  architectures  under  consideration,  a  remote 
memory  reference  requires  use  of  the  interconnection  net¬ 
work.  For  a  read  of  remote  memory,  an  address  must 
traverse  the  network  to  the  remote  memory  and  the 
memory  contents  must  traverse  the  network  to  the  source 
of  the  read  request.  For  a  remote  memory  write,  an 
address  and  ward  to  be  written  must  be  sent  to  the  remote 
memory  location.  Generally,  an  acknowledgment  of 
some  type  is  returned  by  the  remote  network  interface. 
The  time  required  for  an  address  or  data  word  to  traverse 
the  network  can  be  divided  into  two  components.  These 
are  transfer  time  and  overhead.  The  transfer  time  is  the 
actual  propagation  time  required  assuming  no  contention 
in  the  network.  Transfer  time  is  a  constant  for  networks 
that  have  equidistant  paths  between  any  source  and  desti¬ 
nation  pair  (e.g..  multistage  cube  network),  but  may  vary 
far  other  networks  (e.g.,  hypercube).  The  overhead  is 
time  spent  to  establish  a  path  through  the  network  and  the 
time  spent  waiting  due  to  network  contention  or  conten¬ 
tion  at  the  remote  memory.  In  general,  the  determination 
of  overhead  is  a  difficult  problem  that  depends  on  the  net¬ 
work  implementation  details  and  the  message  traffic. 
Thus,  tg  is  only  an  estimated  average  value;  calculating 
an  exact  time  for  each  transfer  would  require  machine 
and  application-dependent  analysis.  Research  on  detailed 
modeling  of  networks  has  been  the  topic  of  entire  papers 
[e.g..  11],  and  is  beyond  the  scope  of  this  work. 

Here,  for  the  sake  of  simplicity,  remote  memory 
reads  and  writes  are  assumed  to  take  the  same  amount  of 
time.  The  model  could  be  extended  to  include  different 
costs  fra  the  two  types  of  references. 

The  number  of  local  and  remote  memory  references 
made  by  PE  P  during  instruction  i  (when  no  PE  is  faulty) 
is  denoted  by  m£®  and  MjJ®,  respectively.  A  compiler 
could  determine  these  values  fra  every  instruction  in  a 
task  as  well  as  the  amount  of  time  spent  by  PE  P  on 
instruction  i  doing  non-memory  reference  work  (^com¬ 
putation  time”),  T?®.  Fra  a  disabled  PE.  Mt®=0, 
Mr®  =  0,  and  T?®  =  0.  Thus,  the  total  time  spent  by  PEP 


1-238 


1 992  international  Conference  on  Parallel  Processing 


an  instruction  i  is  given  by 

k.M£®  +  t*M§®  +TgS). 

Therefore,  the  new  SIMD  execution-tune  model 
becomes 

Tsimd  =  Tsc-ToL+5jnMtx 

The  model  for  determining  SIMD  task  execution 
time  presented  above  can  be  applied  in  cases  where  the 
exact  sequence  of  instructions  to  be  executed  is  known  at 
compile  time.  However,  far  most  programs  the  sequence 
of  instructions  executed  is  data  dependent  For  such  pro¬ 
grams  it  is  not  possible  to  determine  an  execution  time 
during  compilation.  Furthermore,  it  is  not  practical  far 
the  SC  to  determine  the  amount  of  execution  time  left 
when  a  fault  occurs,  because  it  would  potentially  have  to 
simulate  the  execution  of  every  instruction  on  every  PE 
to  determine  when  the  task  has  completed.  In  cases 
where  empirical  data  is  available  on  the  execution  time 
of  a  particular  task,  it  is  possible  to  determine  an 
estimate  of  the  execution  time  remaining.  Such  an  esti¬ 
mate  is  discussed  in  the  next  section.  For  data- 
dependent-execution-dme  tasks  where  there  is  little 
empirical  data,  there  is  no  practical  way  of  determining 
an  execudon  time.  In  such  cases,  the  reconfiguration 
decision  may  have  to  be  made  without  benefit  of 
execudon-dme  knowledge. 

4,2.  MIMD  Task  Execudon  Time 


It  is  assumed  in  the  following  model  that  an  instruc¬ 
tion  cycle  is  composed  of  a  fetch  phase  and  an  execute 
phase.  Because  each  PE  may  execute  a  different  program 
in  MIMD  mode,  let  I(P)  denote  the  number  of  instructions 
that  PEP  must  execute  to  complete  a  task  and  let  Tpjp  be 
the  time  for  PE  P  to  execute  its  i*  instruction.  Then, 
Tmimd*  die  execudon  time  for  a  task  in  MIMD  mode  is 

i(p>  1 

. 

»  4 

This  “max  of  sums"  for  MIMD  versus  “sum  of  max’s" 
for  SIMD  has  been  discussed  in  another  context  in  [3], 
Let  W1*®  represent  the  number  of  words  that  PE  P 
must  fetch  Grom  local  memory  to  read  in  instruction  i.  As 
before,  the  number  of  local  and  remote  memory  refer¬ 
ences  made  by  PE  P  during  its  1th  instruction  is  denoted  by 
Mi®  and  Mr  respectively.  A  compiler  could  determine 
these  values  for  every  instruction  in  a  task  as  well  as  t£®  , 
the  amount  of  time  spent  by  PE  P  on  its  i*  instruction 
doing  non-memory-reference  work.  Thus,  the  MIMD 
task  execution  time  becomes 


Tmmd  “  tnax 
? 


Tvomd  -  max 

p 


m 

I 

m\ 


As  with  the  SIMD  execution-time  model,  the  equa¬ 


tion  above  is  useful  only  when  the  exact  sequence  of 
instructions  far  every  MIMD  subtask  is  known.  The 
MIMD  execution-time  equation  given  does  not  account 
for  conflicts  in  the  interconnection  network,  so  die  exe¬ 
cution  time  derived  Grom  the  equation  will  be  a  minimum 
execution  time. 

5.  DATA-DEPENDENT  MODEL 


Most  tasks  have  a  data-dependent  execution  time, 
making  the  execution-time  model  presented  in  the 
preceding  section  inapplicable.  However,  in  a  production 
environment  where  a  task  is  executed  repeatedly  on  vari¬ 
ous  sets  of  data  (e.g„  image  processing  of  satellite  pic¬ 
tures),  empirical  studies  can  be  performed  to  derive  an 
estimated  execution  time  far  a  task.  For  these  cases  it  is 
possible  to  develop  a  model  that  will  predict  the  expected 
execution  time  remaining  for  a  task  that  is  recovering 
from  a  Gault 


5.1 .  Execution  Time  on  a  Different  Submachine 


Once  a  task  has  been  migrated  from  a  submachine 
containing  a  faulty  PE  to  a  fault-free  submachine  of  equal 
size,  the  time  to  complete  the  task  execution  will  be  the 
same  as  it  would  have  been  on  the  original  fault-free  sub¬ 
machine.  An  estimate  of  this  expected  completion  time  is 
now  derived. 

Let  To _ be  a  discrete  random  variable  that 

represents  the  execution  time  for  a  specific  task.  Tp —  is 
always  positive  because  a  negative  execution  time  is  not 
possible.  jn*Enw)  is  the  discrete  density  function  [23]  of 

Tens-  If  Tp _ takes  the  value  x*  with  probability  ft,  and 

5(x)  is  the  impulse  function  such  that  S(x-Xj)  has  value 
one  at  x  =  x*  and  value  zero  elsewhere,  then 


f(Tl»Me)  "  X  R5(Temb  ~  *i)- 

i 


By  definition,  the  expected  value  of  TV —  is 


ElTi 


Eue 


-  /Te,.,  GCTemb)  ^Exte  -  Tl« 


It  is  assumed  that  the  amount  of  execution  time 
spent  on  a  task  prior  to  a  checkpoint  is  stored  with  that 
checkpoint  If  a  recovering  task  is  to  proceed  from  a 
checkpoint  and  the  execution  tune  stored  with  that  check¬ 
point  is  t.  Tca^&uw,  the  expected  amount  of  time  required 
to  complete  task  execution  is 


TcapfUae  =  ^loae  ”  *• 

T^-p —  assumes  that  the  submachine  size  remains  the 
same  and  that  all  the  PEs  in  the  submachine  are  Gault-free. 


52.  Execution  Tune  on  a  Fault-Free  Submachine 

Here,  the  completion  time  of  a  task  that  completes 
execution  on  a  subdivision  that  is  half  the  size  of  the  ori¬ 
ginal  submachine  is  considered.  Let  T*  be  the  expected 
execution  time  of  a  task  on  a  submachine  consisting  of  2x 
PEs.  and  let  T,  be  the  expected  execution  time  of  the 


1-239 


same  task  remapped  onto  one  of  the  two  equal-sized  sub¬ 
divisions  of  the  anginal  submachine.  If  the  average  time 
for  an  inter-PE  transfer  remains  the  same  in  either  case, 

T* 2 

The  inequality  becomes  an  equality  when  the  mapping  of 
the  task  from  the  subdivision  onto  the  2x  PE  submachine 
is  optimal. 

This  equation  can  be  extended  to  submachines  of 
arbitrary  size  as  follows.  Let  Sou  be  the  number  of  PEs 
in  the  old  (original)  submachine  and  let  be  the 
number  of  PEs  in  the  new  submachine  (subdivision). 
Then,  the  expected  remaining  execution  time  on  the  sub¬ 
division  is  bounded  by 

Tc^5l^[TW_T)- 

A  more  accurate  remaining  execution  time  estimate 
can  be  obtained  if  empirical  data  is  available  to  deter¬ 
mine  expected  task  execution  times  for  the  submachine 
sizes  of  interest  Then,  the  estimated  task  execution  time 
becomes  a  function  of  the  submachine  size  and  the  esti¬ 
mate  of  the  remaining  execution  time  becomes 

* 

TcmpEwc  =Tl«*oc(SNcrw)  1 


An  execution-time  estimate  for  the  task  redistribution 
recovery  option  is  mare  difficult  than  for  the  previous 
options.  Consider  a  task  executing  on  a  submachine  of 
size  Sou  tn  MIMD  mode.  If  a  PE  becomes  faulty  and  its 
subtasks  are  distributed  equally  to  the  Sn„,  =  Sou  -  1 
fault- free  PEs  in  the  submachine,  the  remaining  execution 
time  is  bounded  as  follows. 

TGnpEMC  j  (TU'tj 

Consider  the  case  where  the  faulty  PE's  subtasks  cannot 
be  distributed  equally  among  the  fault-free  PEs.  In  the 
worst  case,  all  the  faulty  PE's  subtasks  would  be 
assigned  to  one  PE.  In  this  case,  the  remaining  execution 
time  could  be  twice  that  of  the  remaining  execution  time 
on  a  fault-free  submachine.  The  degree  to  which  the 
performance  of  the  system  on  a  task  is  degraded  is  a 
function  of  the  system  (a),  the  algorithm  implementation 
(p),  and  the  number  and  location  of  faulty  PEs  (7).  Let 
d(a,(J,7)  be  defined  as  the  amount  of  performance  degra¬ 
dation:  '.  <  d(a,p,y)  s  2.  The  amount  of  performance 
degranaaon  can  be  estimated  from  empirical  data  or  from 
user  -t ovi  tea  information  about  the  algorithm  implemen¬ 
tation.  i'i:e  remaining  execution  time  then  becomes 

"oop£jusc  s  d(a,p,Y)  (tu«  ~  t  • 

Using  the  execution-time  models  presented  in  this 


section,  one  can  arrive  at  an  estimate  of  the  remaining 
execution  time  for  a  task  after  recovery  from  a  fault 
The  remaining  execution  time  fra-  tasks  with  data- 
dependent  execution  times  for  which  no  empirical  data  is 
available  cannot  be  estimated  reliably.  In  these  cases,  the 
user  must  be  required  to  supply  an  estimate  prior  to  exe¬ 
cution  time  if  a  comparison  of  recovery  options  is  to  be 
performed. 

An  alternative  method  fra*  determining  remaining 
execution  time  centers  around  the  use  of  an  automatic 
complexity  evaluator  such  as  that  presented  in  f  18J.  The 
approach  here  is  to  attempt  to  generate  a  nonrecursive 
function,  prior  to  run  time,  that  can  be  solved  at  run  time 
to  determine  the  asymptotic  time-complexity  behavior  of 
a  program.  This  approach  is  applicable  to  a  wide  range 
of  programs,  but  it  is  not  always  successful  in  generating 
a  nonrecursive  function.  Further  study  is  needed  to  see 
how  it  can  be  applied  here. 

6.  CONCLUSIONS 

The  viability  of  a  quantitative  model  of  system 
reconfiguration  due  to  a  PE  or  PE  memory  module  fault 
was  examined.  Three  fault  recovery  options  were  dis¬ 
cussed  and  the  parameters  required  to  determine  their 
respective  costs  were  identified.  It  has  been  shown  that 
given  the  present  technology,  collecting  precise  values 
for  some  of  these  parameters  is  very  difficult  As  an 
example  of  this,  two  models  for  determining  the  remain¬ 
ing  execution  time  of  a  task  after  reconfiguration  to 
tolerate  a  fault  were  presented.  The  first  is  applicable  to 
tasks  with  data- independent  execution  times  and  relies  on 
compile-time  analysis  of  the  task.  The  second 
execution-tune  model  is  for  use  with  tasks  with  data- 
depeodem  execution  times  that  are  production  oriented. 
This  model  uses  empirically  derived  or  user  suplied 
expected  execution  times  to  estimate  the  remaining  exe¬ 
cution  time  for  tasks. 

The  model  presented  in  the  preceding  sections  can 
be  used  to  derive  some  heuristics  to  aid  in  the  decision  of 
which  option  to  choose.  Consider  a  SIMD  task  for  which 
every  PE  in  a  subdivision  of  the  submachine  checkpoints 
to  a  PE  in  the  other  subdivision  in  such  a  way  that  no  two 
PEs  checkpoint  to  the  same  PE.  Then,  either  subdivision 
will  contain  all  the  data  needed  to  recover  from  a  PE  fault 
and  no  data  will  have  to  be  moved  during  the  recovery  if 
the  subdivision  option  is  used  (although  a  better  mapping 
may  exist).  Let  Dq*pb  be  the  amount  of  data  check- 
pointed  per  PE  during  the  last  checkpoint  operation  and 
let  be  the  expected  time  10  transfer  one  data  item  across 
the  interconnection  netware.  If  Ecsm*  *  t*  <  (tu  -t). 
the  task  migration  operas  onouiu  nor  oe  consideret 
because  it  would  take  longer  man  completing  the  task  on 
a  subdivision. 

It  is  possible  that  the  overhead -squired  to  determine 
the  best  fault-recovery  optica  may  00  greater  than  the 
possible  benefit  of  selecting  the  best  option.  Let  To*** 
be  the  expected  time  to  decide  00  the  best  option.  For  ar 
SIMD  task,  if  die  subdivision 


1-24  C 


1992  International  Conference  on  Parallel  Processing 


option  can  be  used  to  complete  the  task  before  a  better 
option  (if  one  exists)  can  be  selected.  The  subdivision 
option  is  chosen  because  there  is  no  possibility  of 
interference  with  tasks  on  other  submachines  and  no  algo¬ 
rithm  dependent  redistribution  of  data  (and  associated 
overhead)  is  needed. 

For  MIMD  and  SPMD  tasks,  if  all  the  subtasks  of 
the  faulty  PE  are  redistributed  to  one  fault-free  PE,  the 
task  will  complete  in  approximately  the  same  amount  of 
time  as  it  would  on  a  subdivision.  When  the  subtasks  of 
the  faulty  PE  are  redistributed  to  more  than  one  fault-bee 
PE.  the  redistribution  option  will  be  faster  than  the  subdi¬ 
vision  option.  Thus,  as  a  general  heuristic,  there  is  no 
need  to  consider  the  subdivision  option  far  MIMD  and 
SPMD  tasks. 

For  production  environment  tasks  it  may  be  possible 
to  collect  empirical  data  on  expected  completion  time  for 
a  task  as  a  function  of  the  time  the  task  spent  executing 
before  the  fault  and  the  recovery  method  used  (the  fault 
location  may  also  have  to  be  considered).  Then,  the 
choice  of  recovery  options  becomes  simply  a  matter  of 
chosing  the  option  with  the  smallest  completion  time 
given  the  point  at  which  the  fault  occurred.  Of  course,  for 
the  task  migration  option  it  may  be  impossible  to  predict 
the  amount  of  interference  possible  from  tasks  executing 
on  other  submachines. 

In  conclusion,  using  the  model  presented  here,  it 
may  be  possible  to  choose  an  “optimal”  reconfiguration 
strategy  for  some  classes  of  tasks  executing  on  a  given 
machine.  However,  much  work  remains  before  a  quanti¬ 
tative  model  of  system  reconfiguration  will  be  practical 
for  all  classes  of  tasks.  Analysis  of  such  models  helps 
designers  of  fault  tolerant  systems  determine  require¬ 
ments  for  compilers  and  operating  systems  to  provide 
needed  input  parameters  for  the  model.  It  also  serves  to 
identify  areas  where  new  technology  is  needed. 

Plans  for  future  work  include  the  validation  of  the 
model  presented  here  by  experimentation  and/or  simula¬ 
tion. 

Acknowledgments:  The  authors  gratefully  acknowledge 
useful  discussions  with  James  Armstrong,  Hasan  Cam, 
ind  Peter  Wahle.  The  authors  also  wish  to  thank  the 
inonymous  referees  for  their  helpful  suggestions. 

REFERENCES 

1]  V.  Balasubramanian  and  P.  Bancrjee,  “CRAFT: 
Compiler-assisted  algorithm-based  fault  tolerance  in 
distributed  manor  multiprocessors.”  1991  Int’l 
Corf,  on  Parallel  Processing ,  v.  I,  Aug.  1991,  pp. 
505-5G3. 

21  K.  £.  Batcher.  “Bit-scnai  parallel  processing  sys¬ 
tems,”  IEEE  Trans.  CompuL,  v.  C-31,  May  1982. 
pp.  377-384. 

3]  T.  B.  Berg.  i\  D.  Kim,  and  H.  J.  Siegel,  “Limitatiofis 
imposed  on  mixed-mode  performance  of  optimized 
phases  due  to  temporal  juxtaposition.”  J.  Parallel 


Distrib.  Comput.,  Special  Issue  on  Massively  Paral¬ 
lel  Computation,  v.  13,  Oct.  1991,  pp*  154-169. 

[4]  T.  Blank,  “The  MasPar  MP-1  architecture,"  IEEE 
Compcon,  Feb.  1990,  pp.  20-24. 

[5]  W.  Gowther,  J.  Goodhue,  R.  Thomas,  W.  Milliken. 
and  T.  Blrckadar,  “Performance  measurements  on  a 
128-node  butterfly  parallel  processor,”  1985  Int'l 
Corf.  on  Parallel  Processing,  Aug.  1985,  pp.  531- 
540. 

[6]  A.  T.  Dahbura,  G.  M.  Masson,  and  C.  Yang,  “Self- 
implicating  structures  for  diagnosable  systems.” 
IEEE  Trans.  Comput.,  v.  C-33,  Aug.  1985,  pp.  718- 
723. 

[7]  F.  Darema-Rodgers,  D.  A.  George,  V.  A.  Norton, 
and  G.  F.  Pfister,  Environment  and  System  Interface 
for  VM/EPEX,  Research  Report  RC1 1381  (#51260), 
IBM  T.  J.  Watson  Research  Center,  1985. 

[8]  S.  A.  Fineberg,  T.  L.  Casavant,  and  H.  J.  Siegel, 
“Experimental  analysis  of  a  mixed-mode  parallel 
architecture  using  bitooic  sequence  sorting,”  J. 
Parallel  Distrib.  Comput.,  v.  11,  Mar.  1991,  pp. 
239-251. 

[9]  M.  J.  Flynn,  “Very  high-speed  computing  sys¬ 
tems,”  Proc.  IEEE,  v.  54,  Dec.  1966,  pp.  1901- 
1909. 

[10]  T.  M.  Frazier  and  Y.  Tamir,  “Application- 
transparent  error-recovery  techniques  for  multicom¬ 
puters,”  Fourth  Corf,  on  Hypercubes,  Concur. 
Comput.,  &  Appl.,  Mar.  1989,  pp.  103-108. 

[11]  P.  G.  Harrison,  “Analytic  Models  for  Multistage 
Interconnection  Networks,”  /.  Parallel  Distrib. 
Comput.,  Special  Issue  on  Modeling  of  Parallel 
Computers,  v.  12,  Aug.  1991, pp.  357-369. 

[12]  J.  Hastad,  T.  Leighton,  and  M.  Newman, 
“Reconfiguring  a  hypercube  in  the  presence  of 
faults,”  19th  ACM  Symp.  Theory  Comput.,  1987, 
pp.  274-284. 

[13]  J.  P.  Hayes,  T.  N.  Mudge,  Q.  F.  Stout,  and  S.  Colley, 
“Architecture  of  a  hypercube  supercomputer,” 
1986  Inti  Corf,  on  Parallel  Processing,  Aug.  1986, 
pp.  653-660. 

[14]  D.  J.  Hunt,  "AMT  DAP  -  a  processor  array  in  a 
workstation  environment,”  Comput.  Sys.  Sci. 
Engr.,  VoL  4, No.  2,  April  1989,  pp.  107-1 14. 

[15]  Intel  Corporation,  A  Ne*>  Direction  in  Scientific 
Computing,  Order  #  28009-001,  Intel  Corp.,  1985. 

[16]  S.  D.  Kim,  M.  A.  Nichols,  andHL  J.  Siegel,  “Model¬ 
ing  overlapped  operation  between  the  control  unit 
and  pnw.tdng  elements  u>  an  SIMD  machine,” 
Parallel  Distrib.  Comput..  Special  Issue  on  Model¬ 
ing  of  Parallel  Computers;  *.  12  Aug.  1991.  pp. 
329-342. 


1-241 


1992  International  Conference  on  Parallel  Processing 


[17]  D.  R  Lawrie,  “Access  and  alignment  of  data  in  an 
array  processor,”  IEEE  Trans.  Comput.,  v.  C-24, 
Dec.  1975,  pp.  1145-1155. 

[18]  D.  Le  Metayer,  “ACE:  An  Automatic  Complexity 
Evaluator,”  ACM  Trans.  Prog.  Lang.  <fc  Sys.,  v.  10, 
Apr.  1988,  pp.  248-266. 

[19]  G.  J.  Lipovski,  and  M.  Malek  Parallel  Computing: 
Theory  and  Comparisons,  Wiley,  New  York,  1987. 

[20]  M.  Livingston  and  Q.  F.  Stout,  “Fault  tolerance  of 
allocation  schemes  in  massively  parallel  comput¬ 
ers,"  Frontiers  ‘88:  The  2nd  Symp.  on  the  Frontiers 
of  Massively  Parallel  Comput.,  Oct.  1988,  pp.  491- 
494. 

[21]  W.  G.  Nation,  A.  A.  Madejewski,  and  R  J.  Siegel, 
“Exploiting  concurrency  among  tasks  in  partition- 
able  parallel  supercomputers,”  1992  Int’l  Parallel 
Processing  Symp..  Mar.  1992,  pp.  30-38. 

[22]  G.  J.  Nutt,  “Microprocessor  implementation  of  a 
parallel  processor,”  Fourth  Anl.  Symp.  on  Comput. 
Architecture,  Mar.  1977,  pp.  147-152. 

[23]  A.  Papoulis,  Probability,  Random  Variables,  and 
Stochastic  Processes,  Second  Edition,  McGraw- 
Hill,  New  York.  NY,  1984.  p.  71. 

[24]  G.  F.  Pfister,  W.  C.  Brantley,  D.  A.  George,  S.  L. 
Harvey.  W.  J.  Kleinfelder.  K.  P.  McAuliffe,  E.  A. 
Melton,  V.  A.  Norton,  and  J.  Weiss,  “The  IBM 
Research  Parallel  Processor  Prototype  (RP3):  intro¬ 
duction  and  architecture.”  1985  Int'l  Corf,  on 
Parallel  Processing,  Aug.  1985,  pp.  764-771. 

[25]  G.  Saghi,  R  J.  Siegel,  and  J.  A.  B.  Fortes,  On  a 
Quantitative  Model  of  System  Reconfiguration  Due 
to  a  Fault,  Tech.  Rep.  in  preparation,  EE  School, 
Purdue. 

[26]  T.  Schwederski,  R  J.  Siegel,  and  T.  L.  Casavant,  “A 
model  of  task  migration  in  partitionable  parallel  pro¬ 
cessing  systems,”  Frontiers  '88:  The  2nd  Symp.  on 
the  Frontiers  of  Massively  Parallel  Comput.,  Oct 
1988.  pp.  21 1-214. 

[27]  T.  Schwederski,  R  J.  Siegel,  and  T.  L.  Casavant, 
“Optimizing  task  migration  transfers  using  multi¬ 
stage  cube  network,”  1990  Int’l  Corf,  on  Parallel 
Processing,  v.  L  Aug.,  1990,  pp.  51-58. 

[28]  R  J.  Siegel,  “The  theory  underlying  the  partitioning 
of  permutation  networks,”  IEEE  Trans.  Comput.,  v. 
C-29,  Sep.  1980,  pp.  791-801. 

[29]  R  J.  Siegel,  T.  Schwederski,  J.  T.  Knehn,  and  N.  J. 
Davis  IV,  “An  overview  of  the  PASM  parallel  pro¬ 
cessing  system,”  in  Computer  Architecture,  D.  D. 
Gajski,  v.  M.  Milutinovic.  R  J.  Siegel,  and  B.  P. 
Furht,  eds..  IEEE  Computer  Society  Press,  Washing¬ 
ton,  DC,  1987,  pp.  387-407. 


[30]  R  J .  Siegel,  Interconnection  Networks  for  Large- 
Scale  Parallel  Processing:  Theory  and  Case  Studies, 
Second  Edition,  McGraw-Hill,  New  York,  NY, 
1990. 

[31]  R  J.  Siegel,  J.  B.  Armstrong,  and  D.  W.  Watson, 
“Mapping  computer-vision-related  tasks  onto 
reconfigurable  parallel-processing  systems,”  Com¬ 
puter,  Special  Issue  on  Parallel  Processing  for  Com¬ 
puter  Vision  and  Image  Understanding,  v.  25,  Feb. 
1992,  pp.  54-63. 

[32]  L.  W.  Tucker  and  G.  G.  Robertson,  “Architecture 
and  applications  of  the  Connection  Machine,”  Com¬ 
puter,  v.  21,  Aug.  1988,  pp.  26-38. 

[33]  M  U.  Uyar  and  A.  P.  Reeves,  “Fault 
reconfiguration  in  a  distributed  MIMD  environment 
with  a  multistage  network,”  1985  Int'l  Corf,  on 
Parallel  Processing,  Aug.  1985,  pp.  798-806. 

[34]  M.  U.  Uyar  and  A.  P.  Reeves,  “Dynamic  fault 
reconfiguration  in  a  mesh-connected  MIMD 
environment,”  IEEE  Trans.  Comput.,  v.  C-37,  Oct. 
1988,  pp.  1191-1205. 


1-242 


REFERENCE  NO.  19 


Cam,  H.  and  Fortes,  J.  A.  B.,  "Frames:  A  Simple  Characterization  of  Permutations 
realized  by  Frequently  Used  Networks,"  Tech.  Rpt  TR-EE-92-32,  School  of  Electri¬ 
cal  Engineering,  Purdue  University,  July  1992, 44  pages. 

Note  -  This  report  proposes  a  simple  characterization  of  permutation  capabilities  of 
interconnection  networks.  The  proposed  approach  makes  it  computationally  easy  to 
detect  whether  a  particular  interconnection  pattern  (i.e.,  configuration)  can  be  imple¬ 
mented  by  a  network.  Submitted  for  publication  in  IEEE  Transactions  on  Computers. 


Frames:  A  Simple 
Characterization  of 
Permutations  Realized  by 
Frequently  Used  Networks 


Hasan  Cam 
Jose'  A.B.  Fortes 


i 

i 


TR-EE  92-32 
July  1992 


School  of  Electrical  Engineering 

Purdue  University 

West  ILafayettb,  Indiana  47907-1285 


FRAMES:  A  SIMPLE  CHARACTERIZATION  OF  PERMUTATIONS 
REALIZED  BY  FREQUENTLY  USED  NETWORKS 


Hasan  Cam  and  Josf  A.  B.  Fortes 


School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  IN  47907-1285 


ABSTRACT 

Rearrangeable  multistage  interconnection  networks  such  as  the  Benes  network 
realize  any  permutation,  yet  their  routing  algorithms  are  not  cost-effective.  On  the 
other  hand,  non-rearrangeable  networks  can  have  inexpensive  routing  algorithms,  but 
no  simple  technique  exists  to  characterize  all  the  permutations  realized  on  these  net¬ 
works.  This  paper  introduces  the  concept  of  frame  and  shows  how  it  can  be  used  to 
characterize  all  the  permutations  realized  on  various  multistage  interconnection  net¬ 
works.  They  include  any  subnetwork  of  the  Benes  network,  the  class  of  networks  that 
are  topologically  equivalent  to  the  baseline  network,  and  cascaded  baseline  and 
shuffle-exchange  networks.  The  question  of  how  the  addition  of  a  stage  to  any  of  these 
networks  affects  the  type  of  permutations  realized  by  the  network  is  precisely 
answered.  Also,  of  interest  from  a  theoretical  standpoint,  a  new  simple  proof  is  pro¬ 
vided  for  the  rearrangeability  of  the  Benes  network. 

Index  Terms —  Multistage  interconnection  network,  permutations,  rearrangeabil¬ 
ity,  topological  equivalence,  balanced  matrices,  frames. 


This  research  was  supported  in  part  by  the  Office  of  Naval  Research  under  contract  No.  00014- 
90-J-1483  and  in  part  by  the  Innovative  Science  and  Technology  Office  of  the  Strategic  Defense 
Initiative  Organization  and  was  administered  through  the  Office  of  Naval  Research  under  contract 
No.  00014-88-k-0723. 


List  of  Symbols 
/ 3V :  ruercoonectioa  network. 

/P:  inieioonnection  pattern. 

/Pi,:  lateroonoecaon  pattern  formed  by  input  lmks. 

/P,*:  interconnection  pattern  fanned  by  output  Jinks. 

Sffc  switching  box;  switch. 

BE:  baseline  network;  see  Definition  D.3. 

RB:  reverse  baseline  network;  see  Definition  IL3. 

SE:  shuffle-exchange  network;  see  Definition  H3. 

SE~l:  inverse  shuffle-exchange  network;  see  Definition  n.3. 

BS:  Benes  network;  see  Definition  1L3. 

CS:  Ooa  network;  see  Definition  II J. 
iV;  number  of  mpua&utputs  of  a  network. 

K  lOgjiV. 

P  the  standard  a- type  frame  with  k  columns;  see  Definition  HL2. 

P?*:  an  a-type  frame  with  k  columns;  see  Definition  m3. 

F'.j,:  the  universal  frame  with  k  columns;  see  Definitkn  m.6. 

/;  the  identity  permutation  matrix;  see  Definition  IL1. 

R:  the  reverse  permutation  matrix;  see  Definition  3.1. 
r.  the  reverse  pennutatioo  represented  by  R. 

matrix  A  with  N  rows  and  k  columns. 

Afh*(i)‘  the  <th  row  of  matrix  A. 

y  a  permutation  on  the  set  [0.1....,  AM);  see  Definition  HL2. 

0:  a  mapping  of  the  set  { 1.2. . . .  ,k)  into  { 1,2, . . .  ,/s);  see  Definition  3L2. 
P'.  a  tuple  of  partitions;  see  Definition  3L2. 


-2- 


L  INTRODUCTION 

Interconnection  networks  are  utilized  to  provide  communication  among  process¬ 
ing  elements  and/or  memory  modules.  Network  performance  significantly  affects  the 
overall  cost  and  performance  of  a  computational  system  because  processors  may  spend 
a  considerable  amount  of  time  in  processor-processor  and/or  processor-memory  com¬ 
munication.  Therefore,  it  is  important  to  know  exactly  the  interconnection  patterns  that 
can  be  implemented  by  a  network.  In  particular,  it  is  desirable  to  know  what  permuta¬ 
tions  can  be  realized  because  parallel  algorithms  often  require  permutation-type  data 
transfers.  This  paper  presents  a  simple  and  easily  understandable  characterization  of 
the  permutations  realized  by  any  network  with  N=2n  inputs  that  is  topologically 
equivalent  to  one  of  the  following  networks:  first  k  stages,  1  £  k  £  n,  of  the  reverse 
baseline  network,  the  last  /i+fc-1  stages  of  Benes  network  [7],  or  a  cascade  of  baseline 
[11]  and  k- stage  shuffle-exchange  [14]  networks.  The  proposed  characterizations  axe 
based  on  the  notion  of  ‘'frame”  (introduced  in  this  paper),  balanced  matrices  [2]  and 
graph  theory  [3,4]. 

The  effectiveness  of  any  interconnection  network  depends  on  factors  such  as  the 
efficiency  of  the  routing  algorithm,  the  number  and  type  of  permutations  it  realizes, 
and  the  actual  hardware  implementation  of  the  network.  On  one  hand,  rearrangeable 
multistage  interconnection  networks  such  as  Benes  and  Oft-1  (the  QO_1  is  a  cascade 
of  omega  and  inverse  omega  [1])  can  realize  any  permutation.  However,  there  are  no 
known  efficient  routing  algorithms  to  allow  dynamic  configuration  in  an  environment 
where  the  switching  permutations  change  rapidly.  On  the  other  hand,  some  networks 
such  as  baseline  and  omega  have  efficient  routing  algorithms  and  small  propagation 
delays,  but  cannot  realize  many  permutations.  In  these  cases,  it  is  important  to  know 
which  permutations  are  realizable  and  this  is  possible  by  using  the  results  of  this  paper. 

Different  approaches  have  been  proposed  in  the  literature  to  circumvent 
inefficient  routing  algorithms.  One  approach  is  to  determine  certain  types  of  permuta¬ 
tions  that  occur  more  frequently  than  others  in  a  parallel  processing  environment.  Such 
permutations  have  been  classified  by  Lenfant  [23]  into  five  families.  In  order  to  imple¬ 
ment  these  permutations  on  the  Benes  network  with  a  small  propagation  delay,  Lenfant 
proposed  a  specialized  routing  algorithm  for  each  family.  A  permutation  that  fails  to 
be  in  one  of  these  families  still  is  routed  using  an  inefficient  routing  algorithm.  To 
increase  the  number  of  the  families  of  permutations  that  can  be  realized  by  a  network, 
Youssef  and  Arden  [22]  introduced  an  OClog2#)  routing  algorithm  which  sets  the 
(rxr)  crossbar  switches  of  the  first  stage  of  3-stage  Benes  networks  with  N=r2  inputs 
to  a  fixed  configuration  and  acts  exactly  like  a  self-routing  algorithm  in  setting  the 
remaining  switches.  Another  approach  is  to  provide  self-routing  algorithms  for  realiz¬ 
ing  some  classes  of  permutations  in  various  multistage  interconnection  networks  such 
as  Benes,  2n- stage  shuffle-exchange.  Nassimi  and  Sahni  [24]  presented  simple  self- 


-3- 


routing  algorithms  to  realize  some  important  permutations  in  Benes  networks. 
&xghavendra  and  Boppana  [25]  proposed  self-routing  algorithms  to  realize  the  class  of 
linear  permutations  on  Benes  and  2n-stage  shuffle-exchange  networks. 

Although  a  large  number  of  multistage  interconnection  networks  are  extensively 
studied,  there  is  a  relatively  small  number  of  basic  designs  for  their  underlying  topolo¬ 
gies.  Especially,  Benes  networks  and  six  topologically  equivalent  networks,  namely, 
omega,  flip,  indirect  binary  cube,  modified  data  manipulator,  baseli*  and  reverse 
baseline  have  been  investigated  in  depth  and  are  frequently  used  in  research  studies 
and  real  systems.  Characterizations  of  the  topologies  of  these  networks  are  given  in 
[9,26,27].  However,  to  our  knowledge,  the  characterization  of  the  permutations  per¬ 
formed  by  these  and  other  networks  is  done  for  the  first  time  in  this  paper.  One  excep¬ 
tion  is  the  work  of  Lee  [10]  which  characterizes  the  permutations  realized  by  the 
inverse  omega  network  in  terms  of  residue  clashes. 

The  rest  of  the  paper  is  organized  as  follows.  Basic  definitions  and  notations  used 
throughout  the  paper  are  presented  in  Section  H  Also  included  in  this  section  is  a 
motivational  example  for  the  concept  of  frame.  In  Section  EH,  this  concept,  illustra¬ 
tions  of  many  different  frames,  notation  and  terminology  are  introduced.  Permutations 
realized  by  the  &-stage  reverse-baseline,  1  £  £  £  /t,  and  the  networks  which  are  topolog¬ 
ically  equivalent  to  it  are  characterized  in  Section  IV.  In  Section  V,  the  permutations 
realized  by  a  cascade  of  reverse  baseline  and  the  k-stage  shuffle-exchange  networks  are 
identified.  These  cases  show  how  frames  can  be  used  to  characterize  the  permutations 
of  some  relatively  complex  networks  with  more  than  n  stages.  Section  VI  provides 
new  proofs  for  the  rearrangeability  of  the  three-stage  Qos  and  Benes  networks.  Permu¬ 
tations  realized  by  the  last  /i+£~l  stages  of  Benes  network  are  identified  in  Section 
VTL  This  characterization  illustrates  how  frames  can  be  used  to  understand  why  a  net¬ 
work  is  rearrangeable.  Section  Vm  concludes  the  paper.  The  Appendix  (Section  IX) 
contains  the  proofs  of  most  of  theorems  and  lemmata  in  the  paper. 


JL  BASIC  DEFINITIONS  AND  A  MOTIVATIONAL  EXAMPLE 

Throughout  this  paper,  raatrie  t  are  denoted  by  single  capital  letters  and  columns 
of  a  matrix  are  represented  by  the  lower  case  of  the  capital  letter  denoting  that  matrix. 
Matrix  A  having  N  rows  and  k  columns  is  denoted  by  Ay**.  Given  a  matrix,  e.g.  Ay,*, 
the  jth  column  is  denoted  by  aJt  1  £  /  £  fc.  To  be  able  to  refer  to  a  set  of  specific 
columns  of  a  matrix,  the  notation  AZ7  is  used  to  denote  the  submatrix  that  contains 
those  columns  of  A  whose  indices  are  x,  x+1,  *  ■  -  ,y,  where  1  £  x  £y;  if  x  happens  to 
be  greater  than  y,  then  Axrf  refen  to  a  nil  matrix,  unless  stated  otherwise.  If  x=y,  then 
Ax~y  refen  to  a  single  column  ax.  Unless  specifically  stated,  the  number  of  the  rows  of 
a  matrix  Axry  is  assumed  to  be  equal  to  V.  Ay*,(t)  refers  to  the  ith  row  of  the  matrix 


-4- 


Atfm,  where  0  £  <  £  N-l.  A  column  vector  of  N  entries  of  which  half  are  0’3  and  the 
other  half  are  l’s  is  called  a  column  permutation.  Unless  otherwise  stated,  any  column 
of  any  matrix  in  this  paper  is  a  column  permutation.  The  binary  representation  of  a 
potitive  integer  0£b£iV-l  is  (b\b2”'bn)  such  that 
b  =  bl2*-l+b22*-2+  •  •  •  +bn2°. 

A  permutation  on  a  set  X  is  a  bisection  of  X  onto  itself.  A  permutation  /permutes 
the  ordered  list  0,  1,  •  •  • ,  iV-1  into/(0),  /( 1),  •  •  •  ,f(N- 1).  A  cyclic  notation  [20,21] 
can  be  used  to  represent  a  permutation  as  the  product  of  cycles,  where  a  cycle 
(c0cic2  ck-\ck)  means  f(c0)~ci ,  f(ci)  =  c2,  ••*,  /(c*-i)  =  c*.  and. 
f  (ck)  =  CQ.  The  composition  of  several  permutations  f\.f2"  '  fk  is  evaluated,  from 
left  to  right,  i.e.,  it  maps  i  into  /*(  •  •  •  (f2(f\  (/)))  •  •  •  )• 

Definition  11.1.  (Permutation  matrix,  identity  permutation  matrix,  reverse 
permutation  matrix):  A  permutation  h  can  be  represented  by  a  Nxn  binary  matrix 
called  permutation  matrix,  H,  such  that  its  tth  row,  is  the  binary  representation 

of  the  integer  h(i).  The  identity  permutation  matrix  denoted  by  Inm  is  the  matrix 
whose  r'th  row  is  the  binary  representation  of  i  (this  is  called  “standard  matrix”  in 
[12]).  The  reverse  permutation  matrix,  denoted  /?#*,,  is  the  matrix  whose  ;'th  column 
is  the  (n  +l-;)th  column  of  /#** . 

For  instance,  the  identity  permutation  matrix  /8x3,  the  reverse  permutation  matrix 
R  8*3  and  a  permutation  matrix  are  shown  below: 


0  0  0 

0  0  0 

1  1  1 

0  0  1 

1  0  0 

1  0  0 

0  1  0 

0  1  0 

0  0  0 

0  1  1 

1  0  0 

*s»3  = 

1  1  0 
0  0  1 

£8x3  = 

0  1  0 

1  1  0 

1  0  1 

1  0  1 

0  0  1 

1  1  0 

0  1  1 

1  0  1 

1  1  1 

1  1  1 

0  1  1 
»  m 

Clearly,  there  is  a  one-to-one  correspondence  between  permutations  and  permuta¬ 
tion  matrices.  For  instance,  R  g*3  represents  the  permutation  r. 

*  * 

01234567 
r~  04261537  • 

Using  the  cyclic  notation,  r  is  represented  by  r  =  (0)(1  4)(2)(3  6)(5)(7). 


-5- 


ILL  Networks 

In  the  terminology  used  In  this  paper,  a  k- stage  interconnection  network  (IN)  con¬ 
sists  of  k  columns  of  switching  boxes  (SBs),  each  followed  and  preceded  by  links 
which  form  interconnection  patterns  (IPs)  as  shown  in  Figure  ILL  The  IPs  formed  by 
the  input  and  output  links  are  denoted  by  IP^  and  IP  cm,  respectively.  Thus,  an  IN  con¬ 
tains  (k+1)  interconnection  patterns  labeled  IPin,  IP  i,  IP 2^  •  •  •  IP<m-  A 

column  of  IN  contains  NI2  (2x2)  SBs,  each  of  which  can  be  set  either  straight  or  cross. 
Figures  H2,  IL3,  n.4,  IL5,  and  HL6  show  several  networks  considered  in  this  paper  for 
iV=16,  namely,  reverse  baseline,  baseline,  Benes,  the  4-stage  shuffle-exchange  (SE), 
and  the  4-stage  inverse  SE.  If  some  networks  axe  placed  in  parallel  to  form  a  new  IN, 
then  the  IN  is  said  to  be  a  “pile  of  networks”.  Unless  otherwise  stated,  any  IN  is 
assumed  to  have  N  inputs/outputs  and  its  stages  are  labeled  from  left  to  tight  starting 
with  1.  Network  stages  are  defined  below  and  illustrated  in  the  figures. 

Definition  TL2.  (Stages  of  reverse  baseline,  baseline,  Benes,  SE,  and  inverse 
SE  networks):  With  one  exception,  a  stage  in  the  reverse  baseline  and  SE  networks 
consists  of  a  connection  pattern  and  the  following  column  of  SBs.  The  exception  is  the 
rightmost  stage  (i.e.,  the  output  stage)  which  consists  of  the  last  column  of  SBs  and 
both  the  preceding  and  succeeding  connection  patterns.  Stages  are  labeled  from  left  to 
right  in  ascending  order  starting  with  1.  In  the  baseline  network  the  kth  stage 
corresponds  to  the  (/i-k+l)th  stage  of  the  reverse  baseline  network.  (Notice  that  both 
the  reverse  baseline  and  the  baseline  can  have  at  most  n  stages,  by  definition).  In  the 
inverse  SE  network  with  m  stages,  its  kth  stage  corresponds  to  the  (m-k-t-l)th  stage  of 
the  m-stage  SE  network.  In  this  paper,  Benes  network  is  considered  as  being  com¬ 
posed  of  the  first  n-1  stages  of  the  n-stage  baseline  followed  by  the  /t-stage  reverse 
baseline.  (It  could  also  be  considered  as  being  composed  of  the  n-stage  baseline  fol¬ 
lowed  by  the  last  n-\  stages  of  the  /1- stage  reverse  baseline).  Therefore,  the  stages  of 
Benes  network  are  labeled  according  to  the  labeling  rules  of  the  baseline  and  he 
reverse  baseline. 


Flfure  IL3.  The  4-sage  beseline  network  with  16  inpnts/ootpots. 


H  C  *0  Z 


STAGE  1 


2 


3 


4 


Figure  H-5.  The  omega  network  (Le.,  the  4-stage  SE)  with  16  inputs/outputs. 


Figure  IL6.  The  inverse  omega  networic  (le..  the  4-stage  inverse  SE)  with  16  iupuu/outputs. 


-8- 


An  IN  having  N  inputs/outputs  and  k  stages  is  denoted  by  both  IN^  and  IN -.k, 
where  k  2  1.  The  subnetwork  that  consists  of  the  stages  x  through  y  of  IN  1;*  is  denoted 
by  INx:j,  where  1  <,x<y  £  k.  \fx>y,  then  INx:y  refers  to  a  nil  network,  unless  specified 
otherwise.  INj,  l  refers  to  the  jih  stage  of  IN  t;Jt.  The  notation  used  for  net¬ 

works  is  different  from  that  used  for  matrices  because  matrices  are  always  denoted  by 
single  letters. 

Without  loss  of  generality,  it  is  assumed  that  routing  of  a  permutation  through  a 
network  is  done  as  described  in  this  paragraph.  Assuming  that  the  stages  of  the  net¬ 
work  are  labeled  from  left  to  light  starting  with  1,  if  the  routing  tag  is  d\di  •  •  •  dk, 
then  di  is  examined  to  set  the  SB  at  stage  i  as  follows:  if  d;  equals  zero  then  the  output 
is  sent  to  the  upper  output  of  the  SB;  otherwise,  it  is  sent  to  the  lower  output  The  /th 
entries  of  the  routing  tags  of  the  two  inputs  entering  a  SB  are  also  called  the  control 
bits  of  that  SB.  So,  to  set  a  SB  properly  to  either  straight  or  cross  (or  equivalendy  not 
to  have  any  conflict  in  a  SB),  the  control  bits  of  a  SB  must  constitute  the  set  (0, 1 }.  In 
some  networks,  the  routing  tag  of  an  input  equals  its  destination  address,  but  this  is  not 
always  the  case. 

In  this  paper  the  following  convention  is  adopted  to  denote  an  IN:  if  the  name  of 
an  IN  has  more  than  one  word,  then  it  is  denoted  by  the  upper  case  form  of  the  first 
letters  of  those  words;  otherwise,  it  is  denoted  by  the  upper  case  form  of  its  first  and 
last  letters.  Also,  if  XX  denotes  an  IN,  then  the  inverse  XX  network  may  be  denoted  by 
XX~x .  The  following  definition  applies  this  convention  to  the  baseline,  reverse  base¬ 
line,  shuffle-exchange,  inverse  shuffle-exchange,  Benes  and  three-stage  Clos  networks 
of  interest  in  this  paper. 

Definition  EL3.  (BE,  RB,  SE,  SE-1,  BS,  CS,  composite  IN):  The  symbols  BE, 
RB,  SE ,  SE~l ,  BS  and  CS  in  this  paper  refer  to  the  networks  baseline,  reverse  baseline, 
shuffle-exchange,  inverse  shuffle-exchange,  Benes  and  three-stage  Cos  network 
v(2,2,jV/2)  [7,13],  respectively.  (If  the  number  of  inputs/outputs  of  three-stage  Clos 
network  v(2,2,N/2)  is  equal  to  N,  then  each  of  the  outside  stages  of  three-stage  Clos 
network  in  this  paper  contains  N 12  (2x2)  SBs  and  the  middle  stage  consists  of  2  boxes 
with  Nl 2  inputs/outputs  each).  If  an  IN  is  a  cascade  of  different  INs,  then  it  is  called  a 
composite  IN  and  is  denoted  by  the  concatenation  of  symbols  that  represent  the  INs  in 
the  order  they  are  cascaded. 

As  an  example  for  a  composite  network,  the  notation  RB  i-^SE m  £  1,  denotes 
the  network  consisting  of  RB  lyl  followed  by  SE  iy*. 

Linial  and  Tarsi  [2]  introduced  the  concept  of  balanced  matrices  to  establish  a 
relation  between  SE  networks  and  their  realizable  permutations.  The  following 
definition  is  equivalent  to  the  one  given  in  [2]. 


-9- 


Definition  DL4.  (Balanced  matrix):  Let  jV=2*  and  call  a  0-1  matrix  ANxJc  bal- 
inctd  if  either  one  of  the  following  conditions  is  satisfied: 

1.  For  k  £  /»,  it  consists  of  any  k  columns  of  the  binary  representation  of  a  permu¬ 
tation  on  the  set  [0,l,...,iV-l}. 

2.  For  k>n,  every  n  consecutive  columns  form  the  binary  representation  of  a  per¬ 
mutation  on  the  set  {0,1,...,N-1}. 


As  an  example,  two  balanced  matrices  E  and  F  are  shown  below.  But  notice  that 
the  matrix  [E  F]  is  not  balanced. 


£  =  [«!  «2  ei] — 


» 

1 

1 

1 

1 

0 

1 

0 

0 

1 

0 

0 

0 

0 

1 

1 

0 

1 

1 

1 

0 

0 

F  —  [f\  fz\  — 

1 

0 

1 

1 

0 

0 

1 

0 

1 

1 

0 

1 

0 

0 

0 

1 

1 

0 

0 

Definition  IL5.  (Pass,  realize):  A  balanced  matrix  ANxJc  (respectively,  an  IN)  is 
said  to  pass  a  k-stage  IN  (respectively,  a  matrix  A#**)  if  no  conflict  occurs  in  the  SBs 
of  the  IN  when  ANxk(i)  is  used  as  the  routing  tag  for  the  ith  input  of  the  IN.  A  network 
IN  realizes  a  permutation  represented  by  if  there  is  a  network  switch  setting  such 
that  input  i  is  sent  to  output  B(i)  for  all  i =  0, 1, . . .  ,N- 1. 

According  to  the  last  definition,  in  this  paper,  the  phrases  “an  IN  passes  a  bal¬ 
anced  matrix”  and  “a  balanced  matrix  passes  an  IN”  are  used  alternatively.  It  is  also 
assumed  that  only  "one  pass"  is  allowed  through  a  network  to  realize  a  permutation. 
Therefore,  the  phrase  "one  pass"  is  omitted  in  the  sequel.  To  emphasize  the  distinction 
between  the  meaning  of  the  terms  “pass”  and  “realize”  as  used  in  this  paper,  it  is 
important  to  notice  that  matrix  A#**  in  Definition  IL5  does  not  necessarily  correspond 
to  the  permutation  realized  by  the  network  IN.  Indeed,  the  ith  row  of  A^**  is  the  rout¬ 
ing  tag  for  input  i  and  it  is  only  when  it  equals  the  destination  of  input  i  that  A***  is  the 
permutation  realized  by  IN;  the  cases  in  which  this  occurs  will  become  clear  in  the 
remainder  of  the  paper. 

ILL  A  Motivational  Example 

Consider  permutations  Jtj  =  (0  6X1  2)(3  5  4)(7),  -  (0  2)(1  4  3  7)(5  6),  and  the 

reverse  baseline  network  with  8  inputs/outputs,  denoted  by  RB  g*3  and  shown  in  Figure 
H7a.  A  frame  is  illustrated  in  Figure  n.7b.  The  binary  representations  of  these  per¬ 
mutations  are  given  below: 


-  iO- 


1  1  0 

0  1  0 

0  1  0 

1  0  0 

0  0  1 

0  0  0 

i  o  i 

*2  = 

1  1  1 

0  1  1 

0  1  1 

1  0  0 

1  1  0 

0  0  0 

1  0  1 

1  1  1 

0  0  1 

(a)  (b) 

Figure  H.7.  (a)  The  revene  baseline  network  with  8  inputs/outputs. 

(b)  A  frame. 

When  the  ilh  row,  0  S  i  £  7,  of  both  and  712  is  used  as  the  routing  tag  for  the  Jth 
input  of  RB&3*  no  conflict  occurs  in  the  switches  and  connections  are  established 
between  the  input  i  and  the  outputs  lli(i)  and  Il2(i),  respectively.  Therefore,  RB g*3 
realizes  and  712.  Now,  let  us  place  the  ith  row  of  and  ft2  into  the  tth  row  of  the 
frame  in  Figure  IL7b  with  8  rows  as  shown  in  Figure  n.8a  and  Figure  H.8b,  respec- 
rively. 


-  il  - 


1 

2 

3 

1 

2 

3 

0 

1 

1 

0 

0 

0  1 

1 

0 

i 

1 

0 

1 

-L. 

0 

0 

2 

0 

0 

1 

2 

0 

0 

0 

3 

1 

0 

1 

3 

1 

1 

1 

4 

0 

1 

1 

4 

0 

1 " 

1 

5 

1 

0 

0 

5 

1 

1 

0 

6 

0 

0 

0 

6 

1  ■ 

0 

1 

7 

_L_ 

_L 

-1_ 

7 

J2J 

JL 

(a)  (b) 


Figure  EL8.  (a)  Frame  wkb  the  binary  representation  of  the  permutation  Ki  . 

(b)  Fiame  with  the  binary  representation  of  the  permutation  jtj. 


The  first  k  columns,  1  £  k  £  3,  of  any  of  these  two  frames  consists  of  23“*  rectangles  of 
size  2*xfc.  Note  that  the  matrix  enclosed  by  any  rectangle  of  the  frames  is  balanced  (in 
fact,  it  represents  a  permutation  on  {0, 1, ... ,  2k }).  It  is  shown  in  Section  IV  that,  when 
the  tows  of  any  permutation  realized  by  the  reverse  baseline  network  are  placed  into 
this  type  of  frame,  the  matrix  enclosed  by  each  rectangle  is  balanced,  and  vice  versa. 
Different  frames  are  introduced  in  this  paper  and  it  is  shown  how  they  are  useful  to 
identify  the  permutations  realized  by  some  frequently  used  networks. 


m.  FRAMES  AND  FUNDAMENTAL  CONCEPTS 

This  section  introduces  the  concept  of  frames  to  characterize  the  permutations 
realized  by  a  network.  Different  frames  are  derived  from  this  concept  and  their  graphi¬ 
cal  representations  are  presented.  In  addition,  some  related  fundamental  concepts  used 
in  the  proofs  of  this  paper  are  introduced.  More  extended  discussion  of  these  concepts 
appears  in  [28]. 

In  order  to  facilitate  the  understanding  of  the  concept  of  frame,  the  following 
definition  is  first  introduced  (a  k-tuple  V  with  the  elements  vj,v2, . . .  ,v*,  denoted  by 
V  »  <v  i  ,V2, ....  v*>,  refers  to  an  ordered  collection  of  k  elements). 

Definition  DOLl.  (Partition  Pj,  block,  standard  partition  P*,  P*):  Let 

X={0, 1, . . .  ,N-1 },  and  i- 1,2,  A  partition  Pi  of  X  is  a  tuple  of  2*“*  dis¬ 

joint  ordered  subsets  of  X ,  called  blocks ,  each  of  which  is  a  tuple  with  2*  distinct  ele¬ 
ments.  The  partition  P*=<  <k,  h+1, , . .  ,/i+2'-l>  such  that  h  mod  2'=  0  and 
h  *0,1, . . .  ,N-1>  is  a  standard  partition  of  X.  The  n-tuple  </**,  1*1,2, . . .  ,n>  is 

denoted  by  P*. 

Example  HL1.  Let  N= 8.  The  following  are  the  standard  partitions: 

P**<  <0,1>,  <2,3>,  <4,5>,  <6,7>  >,  P\—<  <0,1,2,3>,  <4,5,6,7>  >  and 

/»3*<0,1,2,3,4,5,6,7>.  Also,/»**<Pj,  P\,  P\>. 


- 12- 


Thc  notion  of  frame  is  defined  next  and  an  example  (Example  HL2)  is  given  after 
the  definition.  Note  that  the  frame  of  Figures  Q.7  and  n.8  is  characterized  by  the  label¬ 
ing  of  its  columns,  the  labeling  of  its  rows  and  how  each  column  is  partitioned.  There¬ 
fore,  the  definition  of  frame  is  done  in  terms  of  two  mappings  (the  column  and  row 
labeling)  and  a  tuple  of  partitions  (one  far  each  column).  The  column  labels  determine 
die  lumber  and  size  of  the  blocks  in  each  partition  and  the  row  labeling  determines  the 
elements  in  each  block  and  their  order.  As  precisely  stated  in  the  definition,  column 
with  label  {3(i)  corresponds  to  a  partition  with  2""^  blocks  with  2^  elements  each 
and  the  mth  element  within  the  jxh  block  corresponds  to  the  label  y (r)  of  row 
r=2Wi)(/-lH*n  -1).  After  Example  HL2,  a  convenient  graphical  representation  for 
frames  is  introduced  and  its  use  is  illustrated  in  Example  HL3  for  the  frames  described 
in  Example  IH2. 

Definition  HL2.  (Frame):  Let  1  £ k  Z  n  and  1  £  i  £  k.  A  frame  FNvki  1  ZkZru 
is  a  3-tuple  <{3,y ,P>,  where 

—  0  is  a  mapping  of  the  set  {1,2 . k)  into  {1,2, . . .  ,n ), 

—  Y  is  a  permutation  on  the  set  {0, 1, . . .  ,1V- 1 }  and 
—  P  is  a  tuple  of  partitions  <P$(\),P$a), . . . , P 3(*> >  determined  by  (3 
and  y  as  follows: 

P  30)  -  <P  3 OX  l  'P  3(0.2  f-,P  30U‘-*0  >  where 

P 3(»X;  =  <“i •  •  •  .1*2 *n.j>  such  that 

u^j  =y(2m(j-l)+m-\)  for  1  ZjS  2n~m  and  1  £  2m. 

Definition  HU.  (a-frame,  standard  a-frame):  Consider  the  3-tuple  <0,y ,P> 
that  defines  a  frame  Fs^k-  If  0  is  tire  identity  permutation,  then  F.v* t  is  an  a-frame 
denoted  by  Ff/^.  If  3  and  y  are  the  identity  permutations  (which  implies  P=P*),  then 
Fm  is  the  standard  a-frame  denoted  by  Fjft*. 

By  definition  of  standard  a- type  frame,  column  /f*.  I  £i  <,n,  has  2n~l  blocks, 
each  having  2*  rows.  Unless  otherwise  stated,  the  number  of  the  rows  of  Ft*,  k  2  1,  is 
assumed  to  be  N.  Similar  to  the  notation  of  matrices,  to  be  able  to  refer  to  specific 
columns  of  a  frame,  the  notation  Fx:y  is  used  to  denote  the  subframe  that  contains 
those  columns  of  F  whose  indices  are  x,  x+1, . . .  ,y.  Unless  specifically  stated,  the 
number  of  rows  of  Fxry  is  assumed  to  be  N. 

Example  HL2.  The  following  are  examples  of  frames  for  IV =8  and  k- 3. 

(a)  Fgx2=<f},y,P>  where  (3  =  (1 2)(3),  y  is  the  identity  permutation  and 
P-<Pi,P\,Pz>  such  that  F1ssFj,F2=F2  and  Pi=P\. 

(b)  Ff*3=<(3,Y.F>  where  |3  =  identity  permutation,  y=(0)(I  2)(3)(4)(5)(6)(7), 
P=<PUP2,P 3>.  Fj=<  <0,2>,  <1,3>,  <4,5>,  <6,7>  >,  P2-<  <0,2,1,3>, 
<4,5,6,7>  >  andF3=<0,2,l,3,4,5,6,7>. 


•  13- 


(c)  F&j=<f$,YtP>  where  J3»  identity  permutation,  7=  (0)(1 3  64)(25X7), 
P=<PUP2,P 3>,  <5,6>,  <1,2>,  <4,7>  >,  P 2=<  <0,3,5,6>, 


<1,2,4, 7>  >  and?3=<0,3,5,6,l,2,4,7>. 


(d)  Fgx3=<0,7,?>  where  j3  = 


12  3 
2  2  3  ’ 


y  is  the  identity 


permutation. 


P=<P2,Pi,Pi>,P2=Pi  and P 3=/* 3. 


Definition  m.4.  (Graphical  representation  of  a  frame,  rectangle  of  a  frame): 

The  graphical  representation  of  a  frame  fy*±=<(5» y,P>  consists  of  k  columns  labeled 

fi ,  i=l . ky  from  left  to  right  and  N  rows  labeled  lC/),  7*0,1,,. .  ,N-1  starting  at 

the  top.  The  column  fi  corresponds  to  the  partition  that  is,  fi  consists  of  2n*Wi) 
blocks  of  2^{)  entries  each.  In  the  graphical  representation  of  a  frame,  any  polygon 
with  four  sides  and  four  right  angles  is  a  rectangle  of  the  frame. 


Example  DU.  Figures  DLla,  IH.lb,  m.lc  and  ILL  Id  show  the  graphical 
representation  of  the  frames  described  in  the  part  (a),  (b),  (c)  and  (d)  of  Example  m.2, 
respectively.  Figure  ULle  shows  the  graphical  representation  of  the  standard  a-frame 
Pfb-  The  labels  of  the  partitions  below  each  column  are  implicit  by  the  sizes  of  the 
rectangles  in  the  column  and  can  be  omitted. 


fifth 
ol 
1 
2 

3  _ 

4 

5 

6 

71  _ 

*2*2*3 
(<D 

Flgwe  mi.  (a),  ( b ),  (c)  and  (4)  are  the  graphical  representations  of  the  frames  described  in  the 

pert  (a),  (b),  (c)  and  (4)  of  Example  m2,  respectively.  («)  Graphical  representation 
of  the  standard  a-frame  FS3. 


0 

1 _ 

2 

3. _ 

4  { 

5  _ 

6 
71 

*1*2*3 

(e) 


/l/2/3 


*2*1*3 

(a) 


n  fin  fififi 


P1P2P3 

(b) 


PS2P3 

(C) 


Definition  HL5.  (Fit):  Let  k  £  1, 0  £  i  £  W-l  and  1  £j£k.  Consider  a  balanced 
matrix  A^  and  a  frame  Fn*jc-  The  matrix  A  fits  Fjv**  if  and  only  if,  after  placing  a%}  in 
the  (th  row  and  yth  column  of  Fv„*,  every  rectangle  of  Ft contains  a  balanced 
matrix. 


-  14- 


Example  HL4.  The  matrix  E,  shown  just  after  Definition  II.  4,  Sts  ail  the  frames 
shown  in  Figures  Oil  except  ^£3  shown  in  Figure  ULle  because,  for  example,  the 
mbmatrix  in  the  top  leftmost  rectangle  (the  2-tuple  Pu)  is  not  balanced. 

Note  that  the  value  of  it  in  Definition  DL5  does  not  have  to  equal  n.  It  will  become 
clear  that  frames  of  any  number  of  columns  can  be  used  to  characterize  permutations 
<  which  are  represented  by  balanced  matrices  of  n  columns). 

hi  addition  to  n-frames,  other  two  types  of  frames  are  of  use  in  this  paper.  One  is 
called  universal  frame  and,  as  suggested  by  its  name,  any  balanced  matrix  fits  it.  The 
ocher  type  of  frame  is  a  concatenation  of  frames  and  is  useful  in  characterizing  the  per¬ 
mutations  realized  by,  for  example,  composite  networks. 

Definition  HL6.  (Universal  frame  F*:k):  The  universal  frame  F*hkt  k  £  1,  is 
such  that,  for  i- *1,2, . . .  ,k,  (3<i>=n,  y  is  the  identity  permutation  and  /*,  =P*.  The 
universal  frame  Fi-±  is  illustrated  in  Figure  HL2. 


f\  A  fk 


Figure  QL2.  The  universal  frame  F\.m. 

Definition  HL7.  (F“aFj*):  The  notation  F^F*^,  m2:l,  represents  the  frame 
obtained  by  concatenating  Fff*  and  F*^  as  shown  in  Figure  m.3. 


- 15- 


Ftgnre  HU.  The  frame  fT^F’^  which  is  obtained  by  concatenating  F~M  and  F*Vjh. 


The  following  definition  states  precisely  what  means  to  establish  a  correspon¬ 
dence  between  a  frame  and  a  network. 

Definition  HL8.  (Correspondence  between  frames  and  networks):  A  frame 
(respectively,  an  IN)  is  said  to  correspond  to  an  IN  (respectively,  a  frame)  if  a  bal¬ 
anced  matrix  fits  the  frame  if  and  only  if  it  passes  the  network. 

When  a  frame  corresponds  to  a  network  it  suffices  to  check  if  a  matrix  fits  the 
frame  in  order  to  determine  whether  the  network  passes  the  matrix.  This  does  not  mean 
that,  when  the  matrix  represents  a  permutation,  the  network  realizes  the  permutation. 
Instead,  it  means  that,  when  the  rows  of  the  matrix  are  used  as  routing  tags,  no 
conflicts  occur  in  the  network. 

The  complexity  of  checking  that  a  matrix  fits  a  frame  is  discussed  next  First  the 
complexity  of  testing  if  a  rectangle  contains  a  permutation  matrix  is  considered.  Next 
the  complexity  of  checking  all  rectangles  of  the  same  size  is  discussed  and,  finally,  the 
complexity  of  checking  all  rectangles  (Le.,  the  entire  frame)  is  derived.  Note  that  it 
suffices  to  consider  only  those  rectangles  whose  number  of  columns  equals  the  loga¬ 
rithm  of  the  number  of  rows.1  To  check  whether  a  given  rectangle  with  x  rows  and 
logx  columns  contains  a  balanced  matrix,  it  suffices  to  verify  that  the  rows  of  the 
matrix  are  distinct.  This  can  be  done  by  building  a  binary  search  tree  starting  with  the 
root  which  corresponds  to  the  first  row  of  the  matrix;  each  row  is  then  added  as  a  leaf 
to  the  tree  as  long  as  it  is  distinct  from  all  previously  inserted  rows  and  so  that  it 
satisfies  the  binary-search-tree  property  [29].  According  to  this  property,  if  v(p)  is  the 
value  of  the  row  that  corresponds  to  node  p,  then  v(y)<v(p)  for  any  node  y  in  the  left 

1  All  logarithms  vein  base  2  unless  sated  otherwise. 


-  16- 

atin.-ss  of  p  and  v  (z)>v  (p)  for  any  node  z  in  the  right  subtree  of  p.  In  the  worst  case, 
tills  procedure  takes  0(x2)  steps  and  has  average  complexity  of  OQclogx)  [29].  If 
several  rectangles  of  the  same  size  exist  in  a  frame,  then  the  total  number  of  rows  con¬ 
tained  in  dl  the  rectangles  with  the  same  columns  is  N.  The  same  procedure  can  be 
usee  for  each  rectangle  and  the  total  worst  case  and  average  complexities  will  be 
O  (iV  2)  and  O  (NlogAO,  respectively.  Because  there  are  at  most  k  different  types  of  rec¬ 
tangles  in  a  frame  with  k  columns,  the  total  worst  case  and  average  complexities  are 
O  ( kN 2)  and  0  (kN logAO,  respectively.  These  bounds  apply  to  any  frame,  but  it  is  pos¬ 
sible  to  do  better  with  particular  frames.  For  example,  for  a-frames  the  worst-case 
complexity  becomes  0  (N2+2(N I2)2+  •  •  •  M.N/2)!2)  =  O  (N2). 


IV.  BASELINE-TYPE  NETWORKS 

Equivalence  relations  among  INs  have  been  extensively  studied  in  the  literature 
using  different  tools  such  as  graph  theory,  group  theory,  and  Boolean  algebra 
[6,1 1,27*261.  Networks  can  be  modeled  by  directed  graphs  where  vertices  and  edges 
represent  switches  and  links,  respectively.  Two  INs  are  functionally  equivalent  if  they 
realize  the  same  set  of  permutations  while  two  INs  are  topologically  equivalent  if  their 
topologies  (i.e.,  directed  graphs)  are  isomorphic.  Wu  and  Feng  [11]  have  shown  the 
topological  equivalence  of  a  class  of  MINs,  which  include  data  manipulator  [14], 
omega  [1],  flip  [15],  SW-banyan  (s=f=2)  [16],  and  indirect  binary  n-cube  [17],  baseline 
and  reverse  baseline  [11],  From  [18],  "the  notion  of  functional  equivalence  is  more 
practical  than  that  of  topological  equivalence  because  it  provides  an  equivalence  basis 
among  networks  at  their  inputs,  and  thus  it  does  not  call  for  any  modification  in  their 
internal  switching  structure".  Given  a  network  in  a  class  of  isomorphic  INs,  it  is  possi¬ 
ble  to  rename  its  inputs  and/or  outputs  so  that  this  network  can  directly  simulate  any 
network  in  the  class  [11].  In  this  section,  all  the  matrices  that  pass  those  networks  that 
are  topologically  equivalent  to  the  k-stage  baseline,  1  <,k  <,n,  are  identified  by  a- 
frames  that  may  differ  only  in  how  their  rows  are  labeled.  First,  the  permutations  real¬ 
ized  by  the  k-stage  reverse  baseline  are  identified.  Then,  this  result  is  extended  to  the 
other  networks.  These  results  also  show  how  the  addition  of  a  stage  to  the  right  of 
these  networks  changes  the  type  of  their  realizable  permutations.  An  algorithm  is  pro¬ 
vided  to  find  whether  a  network  is  topologically  equivalent  to  the  reverse  baseline  net¬ 
work,  its  corresponding  frame  and  how  to  relabel  inputs  and  outputs  to  achieve  func¬ 
tional  equivalence.  Omitted  proofs  are  provided  in  the  Appendix. 


IV.l.  Correspondence  between  F“k  and  RB1;k 

Because  RB  1:n  is  functionally  and  topologically  equivalent  to  BE\:n  [11],  any 
permutation  that  is  realized  by  RB^  is  also  realized  by  BE  i:nt  and  vice  versa.  How- 
ever,  this  is  not  true  for  RB  i:*  and  BE\±,  1  £  k  £  n-1,  because  they  are  not  function¬ 
ally  equivalent  (they  are  only  topologically  equivalent).  But,  the  set  of  balanced 
matrices  that  pass  RB  1;jk  is  the  same  as  the  set  of  balanced  matrices  that  pass  BE  i*  as 
explained  next.  The  network  RB  ^  can  be  obtained  by  repositioning  the  SBs  of  the 
second  stage  through  the  last  stage  of  BE  and  reordering  its  outputs.  It  follows  that 
any  pair  of  routing  tags  that  enter  a  SB  at  the  kth  stage  of  BE  1;Jk  also  enter  a  SB  at  the 
4th  stage  of  RB  and  vice  versa.  So,  if  the  routing  tags  used  in  BE  hk  do  not  create 
any  conflict,  then  they  also  do  not  have  any  conflict  in  the  SBs  of  RB  i*,  and  vice 
versa.  Therefore,  a  balanced  matrix  D  \:k  passes  RB  i*  if  and  only  if  D  passes  BE  i*. 


The  following  theorem  shows  that  there  exists  a  very  close  relation  between 
RB  1;jfc  and  F\jt,  1  £  k  £  n,  so  that  the  matrices  that  pass  the  network  can  be  identified 
by  Fff*.  It  shows  that  the  ith  input  of  RB  j*  is  sent  without  conflicts  to  the  output 


whose  value  equals  ( 


i/2* 


x2*)  plus  the  value  of  D  \:k(i)  when  the  t'th  row  of  a  matrix 


D  i *  that  fits  FT*  is  used  as  the  routing  tag  for  the  ith  input  of  RB  ];jt. 


Theorem  IV.l  A  matrix  D  1:jk  =  [d\  d2  •  •  •  dk]  fits  F\ %  if  and  only  if  D1:Jc 
passes  RB  1;Jk,  1  £  k  £  n.  Moreover,  RB  \±  sends  its  ith  input  to  its  jth  output,  where  j  is 


equal  to  the  sum  of  ( 


i/2* 


x2*)  and  the  value  of  D  1:jk(i). 


Basic  idea  of  proof  (complete  proof  appears  in  Appendix): 


(-*) D  v.k  Fv.k  passes  RB  1:k. 

Induction  on  k  is  used.  For  k= 1,  each  rectangle  of  Ffc  has  a  0  and  a  1.  These 
correspond  to  the  control  bits  of  a  switch  in  RB  i*  and,  thus,  no  conflict  occurs.  For 
k>\,  assuming  the  theorem  holds  for  k-l,  it  is  also  shown  that  each  switch  in  the  4th 
stage  “has”  control  bits  0  and  1  and,  therefore,  no  conflicts  occur.  These  control  bits 
must  appear  as  the  4th  bits  at  the  end  of  identical  (4-l)-bit  rows  of  subframes 
and  of  so  that  D  fits  F&.  Each  subframe  corresponds  to  a  subnet¬ 

work  of  RBi-jt  which  is  also  a  reverse  baseline  network  RB^-^.x . 


(<“)  D  i:jfe  passes  RBi-jt-*D  i;*  fits  Ft?*. 

Induction  on  k  is  used.  For  4*1,  if  d\  passes  RB i ,  then  each  rectangle  of  Ff*  contains 
a  0  and  a  1  and  d\  fits  For  k  >1,  assuming  the  theorem  holds  for  k-l,  it  is  shown 
that  for  the  outputs  of  two  subnetworks  RB$l- i,*_i  and  RB^-^-i  to  cause  no  conflict 
in  any  switch  of  the  4th  stage  it  must  be  the  case  that  a  0  and  a  1  are  added  to  the  £-1 
entries  of  identical  rows  of  the  frames  that  correspond  to  the  two  subnetworks.  This 
implies  that  D  j*  fits  Fff*.  The  value  of  j  follows  from  the  topology  of  RB  1:jk  and  how 
switches  are  set  by  control  bits.  □ 


- 13  - 


Corollary  IV.l.  A  network  with  k  stages  and  N  inputs/outputs  is  topologically 
equivalent  to  the  £- stage  reverse  baseline,  RB  \:k,  if  and  only  if  it  corresponds  to  an  a- 
type  frame  F\-jt  where  1  £  k  £  n. 


TV  2.  Permutations  Realized  by  Baseline-Type  Networks 

In  this  section,  a- type  frames  axe  used  to  characterize  all  the  permutations  real¬ 
ized  by  any  network  that  is  topologically  equivalent  to  the  baseline  network.  An  algo¬ 
rithm,  called  FRAME  JN,  is  introduced  to  determine  the  a-type  frame  that  corresponds 
to  a  given  network.  It  is  also  shown  how  to  construct  a  network  to  realize  all  the  per¬ 
mutations  that  fit  an  a- type  frame. 

Let  Ff:jk(a-1)  denote  a  particular  a- type  frame  where  *pa_1,  i.e.,  whose  row 
labels  form  the  vector  a"1 .  Let  II  denote  a  network  with  k  stages  which  is  the  same  as 
RB  i:Jc  except  that  the  label  of  its  ith  input  equals  the  fth  entry  of  a-1 .  By  Corollary 
IV.l,  a  balanced  matrix  D  i*  fits  Ff^fa-1)  if  and  only  if  D  i*  passes  n.  If  k=n,  any  of 
these  balanced  matrices  represents  a  permutation,  so  that  n  is  a  network  that  realizes 
all  the  permutations  characterized  by  Ff^cf1 ).  If  k<n,  then  the  relation  between  a 
D  uc  that  fits  Ff^(a-1)  and  a  permutation  that  passes  II  is  first  determined.  By  apply¬ 
ing  this  relation  to  every  balanced  matrix  that  fits  F\±(aTl ),  ail  the  permutations  real¬ 
ized  by  II  are  determined.  Theorem  IV.3  determines  the  relation  between  a  balanced 
matrix  that  fits  F f*  and  the  permutation  realized  by  RB  i±  when  this  balanced  matrix 
passes  the  network.  Corollary  IV.3  generalizes  this  result  to  the  class  of  baseline-type 
networks. 

Theorem  IV  J.  A  matrix  DVJtt  1  £*  £  a,  fits  Ff*  if  and  only  if  RBuk  realizes 
the  permutation  represented  by  [/ D  i;*J. 

Proof.  (-4)  Let  DVJc.  fit  Fft.  It  is  shown  that  RBhk  realizes  the  permutation 
represented  by  [/  Vjt.k  D  i*]. 

Theorem  IV.l  states  that  RB  1:*  sends  its  ith  input,  0  £  i  £  N- 1,  to  its  ;th  output, 
where  j  is  equal  to  the  sum  of  ( |i/2*jx2*)  and  the  value  of  Due  to  the  fact 

that  [( |i/2*Jx2*>fcD equals  the  ith  row  of  [I i:n-k  D l*],  &&1*  realizes  the  per¬ 
mutation  represented  by  [/  i*-*  D  l:*]. 

(<-)  Assume  that  RB  \:k  realizes  the  permutation  represented  by  [/ D  i*].  It 
is  shown  that  D  i*  fits  Ft?*. 

Because  RB  X:k  realizes  the  permutation  represented  by  [/  i*-*  D  1;ik],  it  sends  its 
ith  input  to  the  output  whose  value  equals  the  sum  of  ( |i/2*jx2*)  and  the  value  of 
D  1*0),  D  i*  passes  RB  i*.  It  follows  from  Theorem  IV.  1  that  D  i±  fits  Ffe.  □ 


9 


Corollary  IV  J.  Consider  a  k~ stage,  1  £  k  £  n,  network  n  which  is  topologically 
equivalent  to  RB  1;±.  The  network  II  is  functionally  and  topologically  equivalent  to  a 
network  IPMRB  i  jJPom*  where  IPM  and  IP^  are  interconnection  patterns  that  realize 
permutations  a*  and  a,^,,  respectively.  Also,  let  denote  an  a- type  ^-column 

frame  whose  ith  tow  label  equals  ofo1  (i )  for  is 0,1, . . .  ,N- 1.  A  matrix  D  fits 
FU(al!)  if  and  only  if  n  realizes  the  permutation  where  |i  is  the  permuta¬ 

tion  represented  by  the  balanced  matrix  [/ lat_t  D  *:jk]  and  D  \±{i)  =  D  ujt(a^  (/)). 

Corollary  IV. 3  implies  that  the  network  IPMRB  corresponds  to  the  frame 
Fi^(a^),  where  IPM  realizes  the  permutation  a*.  Hence,  for  a  given  Fi-±,  a 
corresponding  network  can  be  constructed  easily.  The  following  example  shows  the 
construction  of  a  network  that  realizes  a  set  of  permutations  which  includes  two  given 
permutations. 

Example  IV.l.  Let  N=16,  k=2,  0  £  i  £N-l.  Assume  that  a*,  and  denote 
the  permutations  realized  by  the  interconnection  patterns  IP  ^  and  IP  out'  Given  two 
permutations  a  =  (0  98512  12  10  14  63711  13)(4)(15)  and 

b  *  (0  7)(1)(2  3  9  13  11  8  4  5)(6  12)(10  15  14),  it  is  shown  how  to  construct  a  network 
IPinRB  l:2lPom  that  realizes  a  set  of  permutations  including  a  and  b.  Let  A  and  B  refer 
to  the  binary  representations  of  a  and  b,  respectively.  By  Theorem  IV.3,  any  permuta¬ 
tion  that  passes  RB  \:2  must  be  represented  by  a  balanced  matrix  whose  first  (leftmost) 
two  columns  form  / \:2  (recall  that  k-2  and/i=4  in  this  example).  If  there  was  only  one 
given  permutation,  then  the  balanced  matrix  representing  the  permutation  could  be 
converted  by  IP^  to  a  balanced  matrix  whose  first  two  columns  form  / 1: 2  because  IP^ 
can  be  chosen  so  as  to  permute  the  rows  in  any  given  way.  However,  if  more  than  one 
permutation  are  given,  and  the  first  two  columns  of  their  binary  representations  do  not 
form  the  same  matrix,  then  IP  out  is  needed  to  convert  the  binary  representations  of 
these  permutations  into  balanced  matrices  whose  first  two  columns  form  the  same 
matrix.  So,  the  matrices  A  and  B  are  first  converted  by  IP  out  to  A  and  B  such  that 
A 12  =£i:2.  Specifically,  converts  A  and  B  to  A  and  B,  respectively  such  that 
A (i)  =  o£j,(A(i))  and  B(i)  =  a^(B  (»')).  Then,  A  and  B  are  converted  by  a,,  to  A  and 
B,  respectively  such  that  the  first  two  columns  of  each  of  these  matrices  form  1 1:2. 
Specifically,  A(i)  =  A(a*1  (/))  and  B(i)=B(a^(i))-  For  instance, 
0W  =  (0  13  12)(1  5  7  2  4)(3  8  6  9  14)(10  15  1 1)  converts  a  and  b  to 


d  =  (06148  17  15  109  3  5  42  13  12  11)  and 

d=(057  1282  14  11  36  13  159057),  respectively.  Similarly, 

a**  (0  5  4  1  7  15  9  3  6  13  12  1 1)(2  14  8)(10)  converts  d  and  b  into 
«  =  (0)<1  2)(3)(4)(5  6)(7)(8)(9  10)(1 1)(12)(13  14X15)  and 


d=(0  3)(l)(2X4  7)(5)(6)(8  11)(9)(10X12 15)(13X14),  respectively.  The  binary 
representations  of  a,  a,  a,  b,  b  and  b  are  shown  below.  The  network  that  realizes  the 


-20- 


permutations  a  and  b  is  shown  in  Figure  IV.  1. 


0 

•  « 

1001 

0 

0111 

0 

»  • 

0110 

0 

0101 

0 

0000 

0 

0011 

1 

0010 

1 

0001 

1 

0111 

1 

0100 

1 

0010 

1 

0001 

2 

1100 

2 

0011 

2 

1101 

2 

1110 

2 

0001 

2 

0010 

3 

0111 

3 

1001 

3 

0101 

3 

0110 

3 

0011 

3 

0000 

4 

0100 

4 

0101 

4 

0010 

4 

0001 

4 

0100 

4 

0111 

5 

0001 

5 

0010 

5 

0100 

5 

0111 

5 

0110 

5 

0101 

6 

0011 

6 

1100 

6 

1110 

6 

1101 

6 

0101 

6 

0110 

4-  7 

1011 

7 

B’  8 

0000 

i-  7 

1111 

7 

B:  8 

1100 

r.  7 

0111 

B.  g 

0100 

A.  g 

0101 

0100 

A.  g 

0001 

0010 

A.  g 

1000 

1011 

9 

1000 

9 

1101 

9 

0011 

9 

0000 

9 

1010 

9 

1001 

10 

1110 

10 

1111 

10 

1001 

10 

1010 

10 

1001 

10 

1010 

11 

1101 

11 

1000 

11 

0000 

11 

0011 

11 

1011 

11 

1000 

12 

1010 

12 

0110 

12 

1011 

12 

1000 

12 

1100 

12 

1111 

13 

0000 

13 

1011 

13 

1100 

13 

1111 

13 

1110 

13 

1101 

14 

0110 

14 

1010 

14 

1000 

14 

1011 

14 

1101 

14 

1110 

15 

1111 

15 

1110 

15 

1010 

15 

1001 

15 

1111 

. 

15 

1100 

Because  a  2x2  switch  has  two  possible  settings  (cross  and  straight),  the  number  of  bal¬ 
anced  matrices  that  pass  a  k- stage  baseline-type  network  with  N  inputs  equals  2kNa. 
By  Corollary  IV.  1,  for  any  given  ^-column  a- type  frame,  there  exists  a  corresponding 
baseline-type  network.  Therefore,  exactly  2wrz  balanced  matrices  fit  any  Ff;2.  For 
k=2, 2?  balanced  matrices  pass  a  baseline- type  network.  Let D\.2>  1  <,r  <,2N ,  denote 
one  of  the  2*  balanced  matrices  that  fit  /^(a*1)-  Also,  assume  that  is  obtained 
from  D\;2  such  that  D Jtfc (i) = D\: 2 (a*,1  (i».  Let  Mr  denote  the  permutation  represented 
by  [f  i:2  D* '*£]•  So,  the  network  shown  in  Figure  IV.  1  realizes  any  of  those  permuta¬ 
tions  that  result  from  Tb*  ith  row  of  D  is  used  as  the  routing  tag  for  the 

tth  input  of  RB  i:2  in  IP^RB  lalPom-  As  an  example,  let  r=l  and  consider  the  bal¬ 
anced  matrix  D\:2,  shown  in  Figure  IV.2a,  that  fits  Ff^(a^).  The  matrix  that  is 
obtained  from  D{.2,  and  the  matrix  (/ l:2  D[\2  1  are  also  shown  in  Figure  IV.2.  When 
the  ith  row  of  D\\ 21  is  used  as  the  routing  tag  for  the  ith  input  of  RB\:2,  RB  1 :2  realizes 
the  permutation  jit  =  (0)(1  3  2)(4  5  6X7)(8  10  1 1)(9)(12  15  14  13)  which  is  represented 
by  [/ 13  Dl]2  ].  On  the  other  hand,  the  network  IPJRB 13  IP  out  realizes  the  permuta¬ 
tion  (094857312  12 6)(10)(11  13)(14  15)  which  results  from  0*411  .Owa-  End  of 
example. 


-21- 


IPj,  RBx2  IP** 
Figure  IV.  1.  The  network  //,urft51.2//,s4-  of  Example  rV.l. 


D\a 


0 

“10 

1 

11 

2 

01 

3 

00 

4 

11 

5 

01 

6 

00 

7 

10 

8 

01 

9 

10 

10 

ll 

11 

00 

12 

00 

13 

11 

14 

10 

15 

_01 

(a) 

FUiaZ) 


D'A 


n 

4 
8 
9 

5 
0 
3 
1 

14 

15 
10 
12 
13 

6 
2 
7 


0 

0 

0 

“  00“ 

0 

“0000“ 

1 

1 

1 

11 

1 

0011 

0 

1 

2 

01 

2 

0001 

1 

0 

3 

10 

3 

0010 

0 

1 

4 

01 

4 

0101 

\ 

0 

5 

10 

5 

0110 

0 

0 

6 

00 

6 

0100 

1 

1 

7 

11 

7 

0111 

1 

0 

8 

10 

8 

1010 

0 

1 

9 

01 

9 

1001 

1 

1 

10 

11 

10 

1011 

Q 

0 

11 

00 

11 

1000 

1 

1 

12 

11 

12 

mi 

0 

0 

13 

00 

13 

1100 

0 

1 

14 

01 

14 

1101 

_ 1 _ 

0 _ 

15 

10 

15 

1110 

<b) 

(C) 

(d) 

Figure  IV .2.  (a)  A  balanced  matrix  0{a  which  fits  FtaCai1).  (b)F?a(ail)  wiihD{2. 

(c)  D\\j  wtaoee  rth  row  equate  ^(a*1 (0).  (d)  U\a  D'i'z  !• 


In  the  rest  of  this  section,  some  preliminary  results  used  in  the  Algorithm 
FRAMELIN  are  first  presented,  then  the  algorithm  is  introduced. 


-  LL- 


Lemma  IV.l.  Let  r  denote  the  reverse  permutation  represented  by  the  reverse 
permutation  matrix  R^**  described  in  Definition  HI.  The  reverse  baseline  network 
RB  lat  realizes  r  when  all  the  switches  are  set  straight 

Proof.  The  permutation  realized  by  RB  i;*  when  all  the  switches  are  set  straight 
is  determined  by  the  interconnection  patterns  IP^JPi,  ...,IPn.\  ,  IP  our  Because 
IP*  =  IP  out  —  identity  pattern,  the  permutation  is  given  by  ai  .02  *  *  •  a*_i  where  a,-  is 
the  permutation  realized  by  IPt.  Permutation  a,  is  such  that  a,-(x)  rotates  left  the  right¬ 
most  t>l  bits  of  x  by  one  position  because  IPt  is  a  pile  of  2n"‘_1  shuffle-exchange  pat¬ 
terns  each  with  2 4+1  links.  Applying  this  operation  for  all  i  starting  with  the  initial 
matrix  yields  the  reverse  permutation  matrix  R^vn  =  [in  i„_i  -  -  -  i  j  ].  □ 

Because  the  reverse  baseline  network  can  be  converted  to  the  baseline  network  by 
repositioning  the  switches  of  the  middle  stages  only.  Lemma  IV.l  is  also  valid  for  the 
baseline  network.  If  there  exists  a  unique  path  between  any  input  and  any  output  of  a 
network,  then  the  network  satisfies  the  Banyan  property  [6,26].  Bermond  eL  al  [26] 
present  a  set  of  properties  to  determine  whether  a  network  is  topologically  equivalent 
to  baseline  network.  Their  main  result  is  formally  restated  below. 

Theorem  IV.2.  [26]  Let  G  be  a  directed  graph  representing  a  network  with  n 
stages  and  N  inputs/outputs  which  satisfies  the  Banyan  property.  This  network  is  topo¬ 
logically  equivalent  to  the  baseline  network  if  and  only  if  both  the  first  j  stages  and  the 
last  j  stages  of  G  contain  2n~J  connected  components  for  each  j,  1  £j  £n. 

This  result  is  used  next  as  the  basis  for  Algorithm  FRAME_IN.  The  description 
of  the  algorithm  is  followed  by  a  proof  of  its  correctness  and  analysis  of  its  complexity. 

Algorithm  FRAME  JN 

A  network  GN  with  2x2  switches,  n  stages  and  2n 
inputs/outputs. 

An  a-type  frame  that  corresponds  to  GN  if  GN  is  topologically 
equivalent  to  the  baseline  network;  the  permutations  oc^  and 
ctoMf  realized  by  the  interconnection  patterns  IPM  and  //’0ltf, 
respectively,  such  that  the  network  IPMGN  \jJPout  is  function¬ 
ally  equivalent  to  RB 

Let  G  denote  a  graph  with  n  ‘’stages’*  that  is  obtained  by 
representing  the  switches  and  links  of  the  given  network  by. 
vertices  and  edges  that  are  directed  from  left  to  right,  respec¬ 
tively. 

Using  a  breadth-first  search  algorithm  check  whether  there 
exists  a  unique  path  between  any  input  vertex  and  any  output 


Input: 

Output: 

Step  1. 

Step  2. 


-23- 


vertex  of  G.  If  so,  go  to  next  step.  If  not,  go  to  Step  9. 

Step  3.  Let  j  and  p  be  integer  variables  initialized  to  0. 

Step  4.  Increment  j  by  1.  If;  >ru  then  go  to  next  step;  otherwise,  using 
a  depth-search  algorithm,  check  whether  the  last  j  stages  of  the 
G  contain  2n~>  connected  components.  If  so,  go  to  Step  4.  If 
not,  go  to  Step  9. 

Step  5.  Increment  p  by  1.  If  p>n,  then  go  to  Step  7;  otherwise,  using  a 
depth-search  algorithm,  check  whether  the  first  p  stages  of  G 
contains  2n~p  connected  components.  If  so,  go  to  next  step.  If 
not,  go  to  Step  9. 

Step  6.  If  p=l,  let  V\  denote  a  vector  of  the  input  labels  of  a  distinct 
connected  component  (a  2x2  switch)  for  each  r,  (1  £  r  £  2"-1), 
and  then  go  to  Step  5;  otherwise,  do:  let  Vp,  1  £  r  £  2H~P, 
denote  a  vector  that  is  formed  by  merging  two  vectors  yp-i 
and  V$~l  for  1  £  s,  t  <;  2'*“p+1  and  s*t  such  that  the  set  of 
entries  of  Vpr  equals  the  set  of  input  labels  of  a  distinct  con¬ 
nected  component  determined  in  Step  S.  Go  to  Step  5. 

Step  7.  Let  y(i)  =  V” (i)  for  t=  1,2,...,V-1  (note  that  V"  is  obtained  in 
Step  6).  Write  "The  a-type  frame  F*M  whose  tth  row  label 
equals  corresponds  to  the  GN". 

Step  8.  Let  o  denote  the  permutation  realized  by  the  given  network 
GN  1:*  when  all  the  switches  are  set  straight.  The  permutation 
realized  by  IP*  is  ctf,  »  y1 .  The  permutation  realized  by  IP^ 
is  otort  =  a-1  .a*  .r,  where  r  is  the  reverse  permutation 
represented  by  the  reverse  permutation  matrix  (see 
Definition  II.  1).  Stop. 

Step  9.  Write  "The  given  network  is  not  topologically  equivalent  to 
baseline  network  and  no  corresponding  a-type  frame  exists”. 

Stop. 

In  Steps  2  through  6,  Algorithm  FRAMH.IN  checks  whether  the  given  network 
satisfies  the  set  of  properties  described  in  Theorem  IV.2.  Specifically,  Step  2  checks 
the  Banyan  property,  while  Steps  3  through  6  check  whether  both  the  first  j  stages  and 
the  last  j  stages  of  the  network  graph  contain  2*~;  connected  components,  for  each  j. 
So,  if  Algorithm  FRAME_IN  fails  in  any  of  these  steps,  then  it  follows  from  Theorem 
IV.2  that  the  given  network  is  not  topologically  equivalent  to  baseline  network  and,  by 
Corollary  IV.  1,  has  no  corresponding  a-type  frame. 

It  is  now  shown  that  the  given  network  corresponds  to  the  a-type  frame  deter¬ 
mined  in  Step  7,  that  is,  any  balanced  matrix  that  fits  the  a-type  frame  determined  in 


-24- 

Step  7  passes  the  given  network,  and  vice  versa.  Theorem  IV.  1  proves  that,  for 
I  £  k  £  it,  the  frame  Fs°k  corresponds  to  RB  \:k,  that  is,  a  balanced  matrix  D  fits  Ff 
if  and  only  if  D  \.k  passes  RB  i*.  Note  that  RB  \.j  is  a  pile  of  2'1-7  RBy^s.  Recall  that 
me  only  difference  between  the  standard  a- frame  Ff*  and  an  a-typa  frame  is  the 
o:der  of  their  row  labels.  Because  Step  7  assigns  7(1)  to  the  ith  row  label  of  Ff;n,  this 
frame  corresponds  to  the  given  network.  Step  8  first  assumes  that  the  permutation  real¬ 
ized  by  the  given  network  equals  o  when  all  the  switches  are  set  straight  Then,  Step  8 
states  that  the  interconnection  pattern  IPM  realizes  the  permutation  ot *  =  Y"1 .  Relabel¬ 
ing  the  ith  input  of  the  given  network  by  7 (i)  is  equivalent  to  adding  the  interconnec¬ 
tion  pattern  IPM  to  the  left  of  the  given  network.  Thus,  any  balanced  matrix  that  fits  the 
a- type  frame  obtained  in  Step  7  passes  the  network  IP^GN lal,  and  vice  versa.  Algo¬ 
rithm  FRAME.IN  also  adds  an  interconnection  pattern  IP  out  that  realizes  a  permuta¬ 
tion  called  a to  the  right  of  the  given  network  such  that  the  network  IPMGN 1  MIPout 
realizes  the  permutation  r  when  all  the  switches  are  set  straight  By  Lemma  IV.  1,  the 
reverse  baseline  (baseline)  realizes  the  permutation  r  when  all  the  switches  are  set 
straight  Therefore,  the  network  IP^GN  \jJPom  is  functionally  and  topologically 
equivalent  to  the  reverse  baseline  and  baseline  networks.  This  completes  the  proof  of 
correctness  of  the  algorithm. 

The  graph  of  Algorithm  FRAME_IN  can  have  at  most  O  (#log#)  vertices 
because  each  vertex  represents  a  switch.  Algorithm  FRAME_IN  uses  a  breadth-first 
search  to  check  whether  the  given  network  holds  the  Banyan  property.  A  depth-first 
search  is  used  to  identify  the  connected  components  of  G,  and  that  the  depth-first  forest 
contains  as  many  trees  as  G  has  connected  components  [29].  If  V  and  E  are  the  sets  of 
vertices  and  edges,  respectively,  the  running  time  of  both  a  breadth-first  search  and  a 
depth-first  search  is  9(V+£).  This  implies  that,  for  each  value  of  j\  Algorithm 
FRAME_IN  takes  0(#log#)  time.  Because  there  are  21og#  iterations,  the  running 
time  of  Algorithm  FRAME_IN  is  ©(V  log2#). 

Algorithm  FRAME_IN  yields  a  frame  that  corresponds  to  the  given  network. 
This  means  that  any  matrix  that  fits  the  frame  also  passes  the  network  and  vice  versa. 
However,  this  does  not  necessarily  mean  that  the  permutation  represented  by  the 
matrix  is  realized  by  the  network.  When  a  balanced  matrix  Dy**  fits  an  a- frame 
corresponding  to  a  baseline-type  network,  the  network  realizes  the  permutation 
d  .  0 1,**,  where  d  is  the  permutation  represented  by  Dv**  and  a**  is  the  permutation 
realized  by  IP  mi  determined  in  Step  8  of  Algorithm  FRAME_IN.  In  other  words, 
given  a  network  that  is  topologically  equivalent  to  the  reverse  baseline,  relabeling  its 
inputs  and  outputs  by  a*1  and  ot^,  respectively,  results  in  a  new  network  that  is  func¬ 
tionally  equivalent  to  the  reverse  baseline. 

As  an  example,  for  #=16,  Algorithm  FRAME_EN  can  be  used  to  characterize  the 
permutations  of  the  following  baseline-type  networks:  generalized  cube,  omega. 


-25- 


indirect  binary  n-cubc,  banyan  (S=F=2),  inverse  omega,  modified  data  manipulator. 
Hip.  The  topological  equivalence  among  these  networks  and  baseline  and  reverse 
.baseline  networks  is  well  known  and  previously  studied  in  [6,11,18,26].  From  Corol¬ 
lary  IV.  1,  each  of  these  networks  corresponds  to  an  a-frame.  Algorithm  FRAME_IN 
yields  the  row  labeling  y  and  a**  for  each  of  these  networks  and  frames  as  follows: 

=  identity  permutation  for  the  reverse  baseline  and  baseline  networks,  y-  the 
reverse  permutation  =  (0)(1  8)(2  4X3  12)(5  10)(6X7  14)(9X11  13X15)  and  a**  =  iden¬ 
tity  permutation  for  the  omega  and  generalized  cube,  y-  identity  permutation  and 
<W  =  (0XD(2  8)(3  9)(4)(5)(6  12)(7  13X10)(11)(14)(15)  for  the  indirect  binary  cube, 
banyan,  inverse  omega,  and  flip  networks, 

y=  (0)(1)(2  8)(3  9)(4)(5)(6  12)(7  13)(10)<1 1)(14)(15)  and  0^=  identity  permutation 
for  the  modified  data  manipulator  network. 

V.  NETWORKS  RB,.,SE1:jb  AND  SEi^RBu, 

This  section  illustrates  how  frames  can  be  used  to  characterize  permutations  per¬ 
formed  by  relatively  complex  networks  with  more  than  n  stages.  It  is  first  shown  that 
the  balanced  matrices  that  pass  the  network  RB  \:nSE  1;m,  m  >  0,  are  identified  by  the 
frame  (Theorem  V.l),  then  it  is  shown  that  RB  i:nSE  i:m  is  functionally  and 

topologically  equivalent  to  SE\}MRB  (Theorem  V.2).  Hence,  any  balanced  matrix 
passing  RB\:nSE\:M  also  passes  SE^RB and  vice  versa.  Theorem  V.l  also  shows 
how  the  addition  of  a  SE  stage  to  the  right  of  RB  \:nSE  affects  the  type  of  permuta¬ 
tions  realized  by  the  network.  Theorem  W2  proves  that  the  addition  of  a  SE  stage  to 
the  right  of  RB  \MSE  is  equivalent  to  tire  addition  of  an  inverse  SE  stage  to  the  left 
of  SExlnRB  All  the  proofs  are  provided  in  the  Appendix. 

V.l.  Balanced  Matrices  and  Shuffle-Exchange  Networks 

Linial  and  Tarsi  [2]  have  shown  how  balanced  matrices  can  be  used  to  determine 
the  number  of  SE  stages  (or  the  number  of  passes  through  a  single  SE  stage)  necessary 
to  realize  a  given  permutation.  Lemma  V.l  below  restates  their  result  using  the  nota¬ 
tion  and  assumptions  of  this  paper. 

Lemma  V.l.  [2]  Let  MNvh  and  Cy*k  be  balanced  matrices  such  that 
Af/Vxm  =  [INim  Cs**],  k  £  1  and  n+k=m.  The  network  SENxJe  realizes  the  permutation 
represented  by 

To  illustrate  Lemma  V.l,  consider  the  identity  permutation  matrix 
/&, 3  =  [i 1  ij  i 3]  and  the  balanced  matrices  M 8*4  =  [/gx3  ij,  M g*s  =  [/ 3*3  »  1  i i\  and 

=  [/ 8*3  /g*3  ].  Because  ^8*5  and  M9x6  are  balanced,  the  permutations 


-26- 


represented  in  binary  by  [1*2  i 3  ij],  [13  i\  i 2]  and  [ij  i 2  13]  are  realized  by  the  single- 
stage  SE,  2- stage  SE  and  3- stage  SE  with  N=S  inputs/outputs,  respectively. 

V 2.  Permutations  Realized  by  RB1:lSEln 

The  following  theorem  shows  how  the  concatenated  frame  Fi*F*:«  can  be  used 
to  characterize  the  permutations  realized  by  RB  \MSE \M. 

Theorem  V.l.  A  balanced  matrix  D 1  ;(„+*,),  m  £  0,  fits  the  frame  F^F^  if  and 
only  if  Dx:(n+m)  passes  the  network  RB^SE  lyn.  Moreover,  RB^SEi-jn  realizes  the 
permutation  represented  by  D  (OT>i):(m-m()* 


VJ.  Permutations  Realized  by  SE7^bRB1;b 

It  is  shown  that  the  network  SE^^RB  lvi  constructed  by  appending  the  network 
SFii,  to  the  left  of  RB  is  functionally  and  topologically  equivalent  to  the  network 
RE  1;„5£  constructed  by  appending  SE \M  network  to  the  right  of  RB\M.  Also, 
because  RB  Vjt  is  functionally  and  topologicaily  equivalent  to  BE\:n,  Theorem  V.2 
remains  valid  when  RB  t**  is  replaced  by  BE  1;*. 

Theorem  VJ.  The  network  RB  i^SEx-^,  m  Z  1,  is  topologically  and  functionally 
equivalent  to  the  network  SEJ^RB  l3l. 

VL  NEW  PROOFS  FOR  REARRANGEABILITY  OF  BENES  AND 

THREE-STAGE  CLOS  NETWORKS 

Rearrangeability  of  Benes  and  three-stage  Qos  networks  is  proven  in  [7,13]  using 
the  Slepian-Duguid  theorem  which  applies  only  to  symmetric  networks.  In  this  section, 
new  simpler  proofs  are  provided  for  rearrangeability  of  these  networks  using  balanced 
matrices  and  the  properties  of  graph  theory.  These  proofs  directly  lead  to  routing  algo¬ 
rithms  [19]  and  provide  an  insight  into  the  proofs  of  Section  VII  that  identify  the  per¬ 
mutations  realized  by  subnetworks  of  the  Benes  network.  In  what  follows,  some 
known  results  from  [2]  and  definitions  used  in  the  proofs  are  presented  first.  Lemma 
VI.  1  from  [2]  is  self-explanatory. 

Lemma  VL1.  [2]  For  n  £  2,  let  A  and  B  be  two  Vx(n-l)  balanced  matrices. 
Then  there  exists  a  column  vector  x  such  that  both  [A  x  ]  and  [x  B  ]  are  balanced 
matrices. 

Note  that,  when  the  order  of  columns  in  a  balanced  matrix  with  at  most  n  columns 
is  changed,  the  matrix  remains  balanced.  Therefore,  the  position  of  x  in  the  matrices  A 
and  B  in  Lemma  VI.  1  is  immaterial.  Because  the  possible  choices  of  vector  x  increase 


-27- 

is  the  number  of  columns  of  A  or  B  is  reduced.  Lemma  VT.l  remains  valid  when  A  wd 
3  have  less  than  n- 1  columns. 

Some  properties  of  balanced  matrices  can  be  captured  by  graphs.  Therefore,  some 
basic  definitions  of  graph  theory  are  given  below.  A  graph  G=(V,E)  consists  of  a  set 
of  vertices  V  and  a  set  of  edges  E,  each  of  which  is  a  pair  of  vertices.  The  union  of 
two  graphs  G\=<y,E\)  ami  Gi=<y,E2)  is  the  graph  G=G \<jGi={V,E i^jEi).  In 
other  words,  an  edge  is  present  in  G=G  \  uG2  if  and  only  if  it  is  present  in  either  G  i 
or  G  i-  A  subset  M  of  edges  in  a  graph  G  is  called  independent  or  a  matching  if  no  two 
edges  of  M  have  a  vertex  in  common.  A  matching  M  is  said  to  be  a  perfect  matching  if 
it  covers  all  vertices  of  G.  More  extended  discussion  of  these  basic  concepts  can  be 
found  in  [3,4]. 

Definition  VI.l.  (Perfect  matching  graph  of  a  matrix):  Let  A  be  an  Nxk 
(1  ££  £n-l,  /i  £  2)  balanced  matrix.  A  perfect  matching  graph  of  A,  denoted  by  PGA, 
is  a  graph  whose  vertices  are  in  one-to-one  correspondence  with  the  rows  of  A,  have 
degree  one  and  vertices  v;  and  v;  are  joined  by  an  edge  only  if  the  ith  row  and  yth  row 
of  A  are  identical. 

If  the  number  of  columns  in  a  balanced  matrix  Ay,*  is  less  than  n- 1  (i.e.,  if 
£<si-l),  then  its  perfect  matching  graph  is  not  unique  because  each  distinct  row  in  A 
appears  2”~*  times.  If  then  PG^  is  unique  because  each  distinct  row  in  A 

appears  twice.  As  an  example,  consider  the  balanced  matrix  Fix2  presented  just  after 
Definition  n.4.  Its  perfect  matching  graph  is  unique  and  shown  in  Figure  VI. la. 

Definition  VL2.  (Labeling):  2 -labeling  or  2-coloring  of  a  graph  is  the  assign¬ 
ment  of  integers  0  and  1  to  its  vertices  such  that  the  labels  of  the  vertices  incident  with 
an  edge  are  different 

Fact  VLL  [2].  The  union  of  two  perfect  matching  graphs  with  the  same  set  of 
vertices  is  a  union  of  disjoint  even  cycles  and,  therefore,  it  can  be  2-labeled. 

Definition  VL3.  (Perfect  matching  graph  of  a  frame  column):  Let  fm  denote 
a  column  of  a  frame  Fy,*.  A  perfect  matching  graph  of  fm,  denoted  by  PGfm ,  is  a  graph 
whose  vertices  are  in  one-to-one  correspondence  with  the  row  labels  of  Fy**,  have 
degree  one  and  vertices  v,-  and  vj  are  joined  by  an  edge  only  if  i  and  j  belong  to  the 
same  block  of  fm. 

Example  VL1.  One  possible  perfect  matching  graph  for  frame  column  f\  in  Fig¬ 
ure  HI. lb  is  shown  in  Figure  VI. la.  The  graph  in  Figure  VI. lb  is  the  unique  perfect 


-:3- 

natching  graph  of  frame  column  ft  in  Figure  HI.  lb. 


o'2  ov‘ 


Vo  V!  V4  V6 

9  0  9  9 


6  9  6  6 

Vl  V3  VS  V7 


6 

V2 


6 

V7 


Figure  VL1.  (a)  The  perfect  matching  graph  of  it  is  also  one  possible  perfect  matching  graph 
for/5  shown  in  Figure  QLlb. 

(b)  The  unique  perfect  matching  graph  for  ft  shown  in  Figure  HI.  lb. 


From  Definition  VL3  and  Example  VT.  1,  it  is  clear  that  the  perfect  matching 
graph  of  the  frame  column  that  consists  of  only  the  blocks  of  sire  two  is  unique  and  is 
also  a  perfect  matching  graph  for  all  the  other  columns  in  the  same  frame. 

Let  the  black  box,  called  P  ( N !)  and  shown  in  Figure  VL2,  denote  a  rearrange- 
able  (permutation)  network  on  N  elements,  i.e.,  it  realizes  all  N!  distinct  permutations 
in  a  single  pass. 


—  0 
—  I 
—  2 

• 

—  N-t 

Figwe  VL2.  A  black  box  which  realizes  all  S !  permutations. 

This  black  box  P(N\)  can  be  expanded  recursively  using  Algorithm  CONS.BENES 
presented  below  until  all  of  its  black  boxes  are  identical  to  (2x2)  switching  boxes 
(SBs),  each  of  which  can  be  set  both  straight  and  cross.  This  expansion  results  in  the 
Benes  network.  Algorithm  CONS  JiENES  substitutes  the  three-stage  Qos  network 
with  R  inputs/outputs,  denoted  by  CS^  and  shown  in  Figure  VL3,  for  the  black  box 
/»(*!). 


Algorithm  CONSJENES 
Input:  A  black  box  called  P  (N !). 

Output:  Benes  Network 

Step  1.  Let  R  be  an  integer  variable  and  be  initialized  to  N.  Relabel 
the  black  box  by  P(R !)  and  let  BS  denote  a  network 

consisting  of  P  (R !). 

Step  2.  Replace  each  and  every  black  box  called  P(Rl)  of  BS  by 
CSr*3  shown  in  Figure  VI.3. 

Step  3.  If  all  the  SBs  of  BS  are  (2x2),  then  call  BS  Benes  network  and 
stop;  otherwise  first  relabel  each  of  its  non-(2x2)  SBs  by  P  (R !) 
and  halve  the  value  of  R,  then  go  to  Step  2. 

Using  the  notions  of  balanced  matrices  and  frames,  it  is  first  shown  in  the  follow¬ 
ing  theorem  that  CSjjo  is  functionally  equivalent  to  P(R !).  Then,  it  follows  that  the 
Benes  network  constructed  by  Algorithm  CONS_BENES  is  reairangeable  because, 
due  to  the  recursive  structure  of  the  algorithm,  only  the  correctness  of  Step  2  needs  to 
be  proven. 


Figure  VL3.  Three-stage  Oos  network  with  R  inputs/outputs  which  is  denoted  by  CSjro,  where 

Theorem  VL1.  Three-stage  Oos  network  with  R  inputs  is  reairangeable. 

Proof.  As  it  is  shown  in  Fig.  VL4,  the  network  C5*x 3  is  composed  of  three  com¬ 
ponents,  namely,  a)  an  inverse  SE  stage  with  2r  inputs/outputs,  b)  a  pile  of  two  per¬ 
mutation  networks  P*( 2'_l !)  and  Pl(2r~l !),  and  c)  a  SE  stage  with  2'  inputs/outputs. 
It  is  assumed  in  this  proof  that,  unless  otherwise  stated,  any  balanced  matrix  has  R  =2r 
rows.  Recall  that  P(2rl)  refers  to  a  rearrangeable  network  on  2r  elements.  Because 
P{?\)  passes  any  balanced  matrix  BVj.  corresponding  to  a  permutation  on  2r  ele¬ 
ments,  CSgn 3  must  also  pass  B\.r  in  order  to  state  that  CS#*3  is  functionally  equivalent 


-30- 


to  P(2Tl). 

It  is  iow  shown  that  the  inverse  SE  stage  with  2r  inputs/outputs  partitions  3  l:r 
intoB^-l^.  andB^-l,*.  such  that  the  submatrices  and  j32r_1x(i— i)  are  bal¬ 

anced,  where  i?2r-l  ^r-i)  and  Bl&- are  the  first  (r-1)  columns  of  B^-l^  and 
bH»t.  respectively.  Both  2? *<,._!)  and  1)  pass  the  permutation  network 

P( 2'*1!)  because  it  realizes  any  permutation  on  2'-1  elements.  Because  the  control 
bits  of  each  SB  must  constitute  the  set  {0, 1 }  to  avoid  conflict,  any  vector  that  fits  /f 
can  be  used  as  the  vector  of  control  bits  of  the  SBs  of  the  inverse  SE  stage.  Let  the 
perfect  matching  graph  of  ff  denote  a  graph  with  R  vertices  such  that  the  vertices  v2; 
and  v2;+l,  0  <ij  £  2r_1-l,  are  connected  by  an  edge,  where  v2/-  and  v2;+1  correspond 
to  the  2/th  and  (2/>l)th  rows  of  respectively.  Let  x  be  a  column  vector  obtained 
by  a  2-labeling  of  the  union  of  the  perfect  matching  graphs  of  /f*  and  B  By  Fact 
VI.  1,  the  matrix  [x  i;<r-i)]  is  balanced.  This  implies  thatx  “partitions”  the  balanced 
matrix  2?i:<r_i)  into  two  balanced  submatrices  B2r-l*(r-\)  and  in  such  a 

way  that  row  i  of  3  belongs  to  1)  if  the  ith  entry  of  x  equals  zero,  and 

belongs  to  otherwise,  where  0  £i  £  2r-l.  Without  loss  of  generality, 

assume  that  the  SBs  of  the  inverse  SE  stage  with  2'  inputs/outputs  are  labeled  in 
ascending  order  starting  with  0  and  that  the  control  bit  for  the  ith  input  is  the  ith  entry 
of  x.  So,  when  the  2/th  and  (2/‘+l)th  entries  of  x  are  used  as  control  bits  for  the  /th  SB 
of  the  inverse  SE  stage,  no  conflict  occurs  and,  hence,  the  matrix  B  i:(r-i)  is  partitioned 
into  i)  and  Because  both  P*(2r~1!)  andPl(2r~ll )  can  pass  any 

balanced  matrix  of  order  2'_1x(r-l),  the  matrices  and  pass 

P*(2'~l !)  and  P  /(2'-1 !),  respectively. 

In  order  for  B  1;r  to  pass  CS*„ 3,  CSr^  must  send  its  ith  input  to  the  output  whose 
value  equals  B  Vj.(i).  So  far,  this  proof  showed  that  CSRx 3  sends  its  ith  input  with  the 
row  B 1  y(i)  to  either  the  hth  output  of  P*( 2r~l !)  or  the  /ith  output  of  Pl(2r~ll),  where 
h  equals  the  value  of  B  i:(r— i)(i)-  Because  B i-j.  is  a  balanced  matrix,  the  last  entries  of 
the  routing  tags  of  the  rows  that  are  sent  to  the  /th  outputs  of  P*{ 2r~l !)  and  Pl(2r~l !) 
constitute  the  set  {0,1}.  Due  to  the  fact  that  the  third  component  of  CSjeo  is  an  SE 
stage,  the  rows  that  are  sent  to  the  /th  outputs  of  P*(2'~l !)  and  Pl(2r''1 !)  enter  the  /th 
SB  of  the  SE  stage.  Because  the  connections  of  the  SE  stage  implement  the  perfect 
shuffle  permutation  and  the  last  entries  of  the  routing  tags  of  the  rows  entering  a  SB 
constitute  the  set  {0,1},  no  conflict  occurs  in  the  SBs.  It  follows  that  CSr^  sends  its 
ith  input  to  the  output  whose  value  equals  B  lT(i).  Therefore,  the  theorem  holds.  □ 

Corollary  VL1.  The  Benes  network  obtained  by  Algorithm  CONS_BENES  is 
rearrangeable. 

Proof.  Because  Steps  1  and  3  of  Algorithm  CONS.BENES  are  relabelings  and 
the  network  is  constructed  recursively,  it  suffices  to  show  that  CS#*3  is  functionally 
equivalent  to  P{2r !).  Because  this  is  proven  in  Theorem  VI.  1,  the  corollary  holds.  □ 


-31- 


VTL  PERMUTATIONS  REALIZED  BY  BS(*_r):(2B_1) 

Recall  that  Benes  network  can  be  considered  as  being  3£v*<A-i>&8/Vx/t-  Theorem 
IV.  1  identified  the  permutations  passing  RBnm  in  the  sense  that  a  balanced  matrix 

passes  RBs*n  if  and  only  if  fits  Likewise,  fie  following  theorem  and 
corollary  determine  the  set  of  permutations  that  pass  the  network  BS  (*-ry.( 2«-i)  which 
consists  of  the  subnetwork  BS  followed  by  RB v**,  where  l£r£n-l. 

(Recall  that  INxry  denotes  the  stages  x  through  y  of  an  IN  and  that  INx:y  refers  to  a  nil 
network  if  x>y).  The  permutations  that  pass  £S(n-r):(2>»-i)  are  characterized  by  the 
frames  defined  next.  This  characterization  illustrates  how  frames  can  be  used  to  gain 
insight  into  why  the  Benes  network  is  rearrangeable.  All  the  proofs  are  provided  in  the 
Appendix.  An  example  is  presented  to  illustrate  the  results  of  these  proofs.  For  N= 16, 
this  example  clearly  shows  how  the  addition  of  the  stage  BSn-r- i  to  the  left  of 
BS(,,_r):(2*_i)  converts  the  frame  that  corresponds  to  BS(n_r):(2n-i)  into  a  new  frame 
that  corresponds  to  the  resulting  network. 


Definition  VEL1.  (Fftr):  The  frame  Fft', 
ke  {1,2, . . .  ,n},  is  a  frame  <{3,y,F>  where 


W)  = 


('*' 


if  1  £i  £r+l 
ifr+l<i 


re  {0,1 . £-1} 


and 


y  is  the  identity  permutation  on  the  set  {0,1, . . .  ,N- 1 }  and 


Pi 


P  r+\  if  1  £  i  <I  r+1 
p*  \fr+l<i£k. 


Note  that  Fi%°  and  are  identical  to  Ff*  and  F*^,  respectively.  As  exam¬ 

ples  of  Fifjf,  the  frames  Ff$°,  Fft1,  Ff$2  and  Ff^3  for  iV=16  arc  illustrated  in  Fig¬ 
ure  vn.i. 


Theorem  VIL1.  Consider  the  frame  F$£,  Q  <,r  £  n-l.  Let  5  be  a  pile  of  2n-r_1 
copies  of  a  rearrangeable  network  P( 2r+1 !).  Let  T  be  an  IN  that  consists  of  the  net¬ 
work  S  followed  by  RB(r+ 2)M.  A  balanced  matrix  fits  F$£  if  and  only  if  Avw» 
passes  T. 


Corollary  VTL1.  A  balanced  matrix  fits  the  frame  F$£  if  and  only  if 
passes  the  network  BS (*-r):<2»-i)»  where  0  £  r  £  n-l. 


-32- 


Example  VIL1.  Let  N= 16  and  n- 4.  The  frames  FffcJl  Fi&*  Ff&i  and  Fife*  are 
shown  in  Figure  VII.  1.  By  Theorem  rv.l,  all  balanced  matrices  that  fit  Ff£«4  pass 
RB  If  the  stage  BE 3  is  added  to  the  left  of  RB  1;4,  the  network  BSyj  shown  in 
Figure  VTL2a  is  obtained.  While  RB  1:4  passes  all  balanced  matrices  that  fit  Ff%,4  (the 
same  as  Ffc*),  a  balanced  matrix  D  1:4  passes  BSyj  if  and  only  if  D  1;4  fits  F^tJ.  If 
the  stage  BE 2  is  added  to  the  left  of  BSi-j,  then  55 2: 7  shown  in  Figure  VTL2b  is 
obtained.  A  balanced  matrix  D  i:4  passes  55* 7  if  and  only  if  D  1:4  fits  Ff&i.  If  the 
stage  BE  1  is  added  to  the  left  of  55^ 7,  then  Benes  network,  55  \.j,  shown  in  Figure 
IL4  is  obtained.  It  is  obvious  that  a  balanced  matrix  D  1:4  passes  55 \-j  if  and  only  if 
D1;4  fits  =Eim-  Notice  that,  when  the  stage  BEj,  1  n-1,  is  added  to  the 

left  of  BEy+iyjH-iyRB  1  Mt  the  subnetwork  5F;:(n_1)55  i:(«_y+i)  becomes  a  pile  of  2y_1 
copies  of  Benes  network  with  2,*~/+l  inputs/outputs  and  2n-2;+l  stages.  Because 
Benes  network  with  2rt_;+1  inputs/outputs  and  2*-2;+l  stages  is  a  reairangeable  net¬ 
work,  it  corresponds  to  the  universal  frame  with  2n~J+l  rows  and  n-j+l  columns. 
Therefore,  the  first  n-j+l  columns  of  F^4*;  is  a  pile  of  2;_1  copies  of  the  universal 
frame  with  2#l~y+l  rows  and  n-;+l  columns.  End  of  example. 


Figwe  VTL1.  (a)  FftS.  (6)  (c)F?ti  and 


-  33- 


(a)  (b) 

Figure  VEL2.  (a)  5S3:7  (5£3  followed  by  RB  XA).  (b)  BS^t  ( BE *3  followed  by  RB  U4). 


VUL  CONCLUSIONS 

In  this  paper,  a  new  approach  has  been  developed  to  characterize  permutations 
realized  by  some  frequently  used  networks.  The  concept  of  frame  has  been  introduced 
and  different  frames  have  been  illustrated.  It  is  simple  to  check  whether  a  given  permu¬ 
tation  is  realized  by  a  given  network  once  the  corresponding  frame  and  the  output 
interconnection  pattern  are  known. 

The  permutations  of  the  following  three  classes  of  networks  have  been  character¬ 
ized;  the  class  of  /t- stage  baseline-type  networks  that  are  topologically  equivalent  to  the 
k-  stage  baseline  network,  the  class  of  those  networks  that  are  constructed  by  appending 
shuffle-exchange  stages  to  the  left  or  right  of  a  baseline-type  network,  and  the  class  of 
those  networks  that  form  a  part  of  Benes  network. 

The  proof  that  Benes  network  is  rcanangeable  was  first  presented  in  [7].  This 
proof  is  based  on  the  Slepian-Duguid  theorem  which  applies  only  to  symmetric  net¬ 
works.  In  this  paper,  a  new  simple  proof  has  been  presented  for  rearran  geability  of 
Benes  and  three-stage  Qos  networks  using  the  notion  of  balanced  matrices  and  graph 
theory.  The  technique  used  in  this  proof  can  also  be  applied  to  nonsymmetric  net¬ 
works. 

In  practice,  the  results  presented  in  this  paper  can  be  used  to  design  networks  that 
realize  classes  of  permutations  that  fit  the  same  frame.  In  addition,  engineers  and/or 
compilers  may  use  frames  to  test  if  the  corresponding  networks  realize  a  given  permu¬ 
tation.  Debuggers  and  programming  environment  can  also  use  frames  to  detect  when 
and  why  a  permutation  cannot  be  realized  by  the  network.  The  definitions,  theorems 
and  lemmata  that  axe  presented  in  this  paper  to  characterize  the  permutations  realized 
in  the  aforementioned  networks  can  also  be  used  to  address  the  issues  of  routing  and 
counting  permutations.  But,  to  limit  the  size  of  this  paper,  these  issues  are  addressed  in 


-34- 


•19,28]. 

It  is  clear  that  frames,  as  defined  in  this  paper,  cannot  characterize  the  permuta¬ 
tions  of  every  network.  Conceivably,  extensions  of  the  definitions  may  be  possible  to 
characterize  a  larger  class  of  networks.  In  particular,  the  concepts  should  be  extensible 
to  networks  not  considered  in  this  paper  including  those  constructed  with  (kxk) 
switches  for  £>2.  Future  research  will  address  these  issues. 


IX.  APPENDIX 


Proof  Theorem  IV.l:  (— ►)  It  is  shown  that  if  fits  Ff*  then  D\.±  passes 
RB  I..*.  Proof  is  by  induction  on  k.  Also,  it  is  proven  that  RB  sends  its  t'th  input  to  its 

r  \ 


jxh  output,  where  j  is  equal  to  the  sum  of 


and  the  value  of  D 


Basis  Step:  Let  k=l.  Label  the  SBs  of  RB  x  in  ascending  order  starting  with  0. 
(Recall  that  RB  j  refen  to  the  first  stage  of  a  reverse  baseline  network  with  iV 
inputs/outputs).  By  definition,  /f  contains  2"-1  blocks  of  size  2  each.  The  fact  that 
D  fits  Fff*  implies  that  d\  fits  /f*.  Therefore,  the  2nh  and  (2r+l)th  entries  of  dx 
constitute  the  set  (0,1},  where  0£r£2*_1-l.  Hence,  when  the  2rth  and  (2r+l)th 
entries  of  dx  are  used  as  the  control  bits  to  set  the  rth  SB  of  RB  x ,  no  conflict  occurs 

and  RB  x  sends  its  ith  input  to  its  ;th  output,  where  j  is  equal  to  the  sum  of  ^  J  i  /2  j  x2 

and  the  value  of  the  ith  entry  of  dx,  where  0  $  i  1.  (Recall  that,  if  the  control  bit 
of  the  routing  tag  of  an  input  equals  zero,  then  the  input  is  sent  to  the  upper  output  of 
the  SB  that  it  enters;  otherwise  it  is  sent  to  the  the  lower  output  of  the  SB). 


Induction  Step:  Assume  that,  for  2  £  k  £  n,  if  D  1:(*_d  fits  Fi?(*_d,  then  D 
passes  RB  i-tt-i)  and  RB  i;(*_d  sends  its  ith  input  to  its  jxh  output,  where  j  is  equal  to 

the  sum  of  ji/2*"1  jx2*-1  and  the  value  of  D  i:(*-i)(i).  Now,  show  that,  if  D  i*  fits 

d 

Ff5t,  then  D  i-±  passes  RB  and  RB  i*  sends  its  ith  input  to  its  ;th  output,  where  j  is 


equal  to  the  sum  of 


i/2k  x2* 


and  the  value  of  D 


The  frame  F'ft*,  (m=k-\,k),  can  be  considered  as  being  composed  of  2*~" 
copies  of  F&xm  in  parallel  if  the  row  labels  of  the  ctth,  0  £  a  £  2n~m-l,  F"**.  consists 
of  the  numbers  (ax2^)  to  [(a+l)x2m-l]  inclusive.  Let  F?-^  denote  the  ath  F?**. 
RBXjm  can  also  be  considered  as  being  the  pile  of  2n~m  distinct  RBy^ s.  Label  these 
RB  2-»*s  in  ascending  order  starting  with  0  at  the  top  and  denote  the  ath  one  by 
By  hypothesis,  D  x„  fits  Fft*.  Let  denote  the  submatrix  of  D  laB  that 


-35- 


fits  F%m.  Thus,  the  induction  hypothesis  also  implies  that  (which  fits 

.1)  passes  and  that  RBfy-'xt-i  sends  its  pth  input  to  the  output  whose 

value  is  equal  to  the  value  of  (P)>  where  0  £p  £  2*"l-l. 

Let  Fji-itt-i  and  F^-i* t-i  denote  the  2/th  and  (2/+l)th  F£-i*±-iS,  respectively, 
where  0£/£2n'*-l.  Similarly,  let  and  Dp-^i  denote  the  2/th  and 

(2/+l)th  1$.  Likewise,  assume  that  and  denote  the  2/th 

and  (2/+l)th  /ZB^-i^.^s. 

Because  is  a  balanced  matrix  of  order  2*_1x(k-l),  it  has  2*_1  distinct 

rows.  Therefore,  the  matrix 

u  _  D%-'<k- 1) 

contains  2*~l  distinct  rows,  each  being  repeated  twice.  Assume  that  the  rows  of  H  are 
partitioned  into  2k~l  classes,  each  of  which  contains  2  identical  rows,  that  is,  each 
riaqffl  contains  the  two  copies  of  a  distinct  row  of  H.  After  adding  a  column  permuta¬ 
tion  of  length  2*  to  the  right  of  H,  call  the  resultant  matrix  This  implies  that 

the  number  of  the  entries  of  the  rows  of  a  class  is  incremented  by  1.  In  order  for 
to  fit  Ffk**  the  kxh  entries  of  the  rows  of  each  class  of  H  must  constitute  the  set 
{0,1},  which  is  true  because  D  l:k  fits  F\dc  by  the  induction  hypothesis. 

By  definition,  the  kth  stage  of  reverse  baseline,  RBk,  consists  of  a  pile  of  2n~k 
copies  of  the  SE  stage  with  2*  inputs/outputs.  Assume  that  the  network  consisting  of 
the  pile  of  two  networks  and  RB^-^k~i)  followed  by  the  SE  stage  with  2k 

inputs/outputs  is  called  RB §»**.  Because  sends  its  pth  input  to  the  output 

whose  value  is  equal  to  the  value  of  (p),  the  first  (k— 1)  entries  of  the  row  that 

is  sent  to  the  pth  output  of  the  network  *s  die  same  as  the  first  (£-1)  entries 

of  the  row  that  is  sent  to  the  pth  output  of  the  network  The  kth  entries  of 

those  two  rows  sent  to  the  pth  outputs  of  and  constitute  the  set 

{0,1}  because  £>§»**  fits  the  frame  F^**  by  induction  hypothesis.  Because  the  rows 
that  are  sent  ©  the  pth  outputs  of  RB^-^k-i)  and  FF?-‘x<*-i)  enter  the  pth  SB  of  the 
SE  stage  following  these  networks  such  that  the  kth  entries  of  these  rows  arc  the  con¬ 
trol  bits  for  the  SB,  no  conflict  occurs  in  the  pth  SB.  This  amounts  to  stating  that 
RB^  sends  its  hth  input  to  the  output  whose  value  is  equal  to  the  value  of  D^(h)t 
where  0  £h  £  2*-l.  Therefore,  the  balanced  matrix  D  VJt  passes  RB  i:k  and  RB  1* 
sends  its  ith  input  to  its  /th  output,  where  0  £  1  £  A/-1  and  j  is  equal  to  the  sum  of 

x2*  and  the  value  of  D  i*(i). 

(<-)  It  is  shown  that,  if  D  l:k  passes  RBlJC,thenD  uk  fits  Fft*.  Proof  is  by  induc¬ 
tion  oak. 


-36- 


Basis  Step:  Let  <=1.  The  fact  that  d\  passes  RB  i  implies  that  no  conflict  occurs 
in  the  SBs  of  RB  i  when  the  ith  entry  of  d  *  is  used  as  the  control  bit  for  the  ith  input  of 
RS'  i  in  setting  its  nh  SB.  Because  the  control  bits  of  the  rth  SB  of  RB !  constitute  the 
set  {0, 1 }  and  fit  the  rth  block  of  d\  fits 

Induction  Step:  Assume  that  the  theorem  holds  for  k- 1.  It  is  shown  that  it  also 
holds  for  kj  where  2  £  k  £  n. 

By  induction  hypothesis,  if  ££-‘*±-1  passes  RB^-i^i  fits  Notice  that 

the  last  stage  of  RB^  is  the  SE  stage  with  2*  inputs/outputs.  Recall  that  the  network 
consisting  of  the  pile  of  two  networks  RB^-i^-d  and  followed  by  the  SE 

stage  with  2*  inputs/outputs  is  called  RB$**k-  As  it  is  also  explained  above,  the  rows 
that  are  sent  to  the  pth  outputs  of  and  enter  the  pth  SB  of  the 

SE  stage  that  follows  these  networks.  If  D§»x*  passes  then  the  kxh  entries  of  the 

rows  of  a  class  of  H  must  constitute  the  set  {0, 1 }  to  avoid  having  a  conflict  in  the  pth 
SB.  Therefore,  fits  F^.  It  follows  that  D  i*  fits  Ft*.  □ 

Proof  of  Corollary  IV.l:  (— >)  Let  be  topologically  equivalent  to  RB  l:k.  When 
interconnection  networks  are  modeled  by  directed  graphs  in  which  vertices  represent 
the  switches  and  edges  the  links,  two  networks  are  said  to  be  topologically  equivalent 
if  the  graphs  representing  them  are  isomorphic.  Two  graphs  G  and  H  are  said  to  be 
isomorphic  if  there  exist  bisections  from  the  vertices  and  edges  of  G  to  the  vertices  and 
edges  of  //,  respectively  such  that  the  relationship  of  adjacency  is  preserved.  So,  if  two 
networks  are  topologically  equivalent  to  each  other,  one  of  them  can  be  made  identical 
to  the  ocher  network  by  relabeling  the  inputs  and/or  outputs.  This  implies  that  d>  can  be 
made  identical  to  RB  i*  by  relabeling  the  inputs  and/or  outputs  of  d>,  and  vice  versa. 
Because  (1)  Ffe  corresponds  to  RB\.jt  such  that  there  exists  a  one-to-one  correspon¬ 
dence  between  the  row  labels  of  Ff*  and  RB  ljk  (Theorem  IV.l),  (2)  the  only  differ¬ 
ence  between  Fff*  and  an  a-type  frame  Ff*  is  the  order  of  their  row  labels,  and  (3)  d> 
is  topologically  equivalent  to  RB  i-j,  there  exists  an  a-type  frame  Ff*  corresponding  to 
O  such  that  no  conflict  occurs  in  the  switches  of  d>  when  the  contents  of  the  rth  row, 
0  £  i  £  A/- 1,  of  F\.jt  are  used  as  the  routing  tag  for  the  ith  input  of  d>. 

(<-)  Let  y  denote  the  vector  of  input  labels  of  <h  such  that  the  ith  entry  of  y  equals 
the  ith  input  label  of  <t>.  Let  Ff^(7)  denote  the  frame  corresponding  to  d>  such  that  the 
ith  entry  of  y  equals  the  ith  row  label  of  the  frame.  By  definition  of  ‘  * correspondence ’  ’ 
(Definition  m.8),  no  conflict  occurs  in  the  switches  of  d>  when  the  contents  of  the  ith 
row  of  Ff  4(7)  are  used  as  the  routing  tag  for  the  ith  input  of  <t>.  Note  that  there  exists  a 
one-to-one  correspondence  between  the  input  labels  of  <t>  and  the  row  labels  of 
Ffjt(7).  Therefore,  when  both  the  ith  row  label  of  Ff*(7)  and  tire  ith  input  label  of  O 
are  replaced  by  tire  integer  i,  the  resulting  frame  Ffi  and  network  still  remain 
correspondent  to  each  other.  By  Theorem  IV.l,  Ff5t  corresponds  to  RB  1*.  It  follows 


-37- 


that  <t>  can  be  converted  to  RB  i*  by  relabeling  the  input  and/or  output  labels  of  d>. 
Thus,  is  topologically  equivalent  toRB\±.  □ 

Definition  DL1.  (forward-routing,  reverse-routing):  Given  an  INNxJc  and  a  set¬ 
ting  of  its  SBs  that  realizes  h  :  i—*h  (i),  forward-routing  of  a  matrix  A  means  that  A  (i) 
is  sent  from  input  i  to  output  h  (i),  where  0  £  i  £  W-l.  Likewise,  reverse-routing  of  A 
means  that  A(i)  is  sent  from  the  output  i  to  the  input  h~l{i).  The  matrix 
Af  =A(h~l(i))i  i- 0,1,...  ,N-U  is  obtained  by  forward-routing  of  A.  Similarly,  the 
matrix  AR  -A  ( h(i))t  i-0, 1, . . .  ,N- 1,  is  obtained  by  reverse-routing  of  A. 

Proof  of  Corollary  IV 3:  Because  the  network  II  is  a  stage  baseline-type  net¬ 
work,  it  is  topologically  equivalent  to  RB  1;±.  This  implies  that  RB  i;*  can  be  made 
identical  to  n  by  relabeling  its  inputs  and/or  outputs.  Because  relabeling  the  inputs 
(respectively,  outputs)  of  RB  i:±  is  equivalent  to  adding  an  interconnection  pattern  to 
the  left  (respectively,  right)  of  RB  i*,  there  exist  two  interconnection  patterns  //**,  and 
IP&U  such  that  II  is  topologically  and  functionally  equivalent  to  IP^RB  \±IPom- 

(-*)  Assume  that  D  1:*  fits  ).  It  is  shown  that  the  network  II  realizes  the 

permutation  a^. 

Adding  the  interconnection  pattern  IP^  to  the  left  of  RB  ^  is  equivalent  to  rela¬ 
beling  the  ith  input  of  RB  1;*  by  a*1  (i).  Because  the  only  difference  between  two  a- 
typc  frames  with  k  columns  is  the  order  of  their  row  labels  and  IPout  is  just  an  intercon¬ 
nection  pattern,  it  follows  from  Theorem  IV.  1  that  D  ^  passes  IT  By  Definition  IX.  1, 
when  D  1:jk  is  forward-routed  through  the  interconnection  pattern  IP^,  D  is  mapped 
to  D\.jt  -Di-jk(aHil(0)t  *  =0, 1 , . . .  ,N-l.  By  Theorem  IV.3,  the  subnetwork  RB  1:jk  of 
II  realizes  the  permutation  p.  represented  by  [/  D*jc J.  Therefore,  the  network  n 

realizes  the  permutation  oc,*.p..aw-. 

(<-)  Assume  that  the  network  II  realizes  the  permutation  It  is  shown 

that  D  fits  Ffjci&iix ). 

The  fact  that  II  realizes  the  permutation  0*41.00*  implies  that  the  permutation  p. 
is  realized  by  RB  i;*  of  IT  Because  p.  is  the  permutation  represented  by  the  balanced 
matrix  [/i*-*  D\:k]  such  that  D^i)  =D1;*(a*  (/)),  it  follows  from  Theorem  IV.3 
that  Du  passes  RB  \:k.  By  Definition  EX.1,  when  D  i;*  is  reverse-routed  through  the 
interconnection  pattern  /P*t  D*:k  is  mapped  to  D  i*.  Thus,  D\.±  passes  IP^RB  i*- 
Note  that  tire  network  IP^RB  1:jk  is  identical  to  the  network  obtained  by  relabeling  the 
/th  input  of  RB  i:t  by  a*l(i).  In  addition,  because  Ff^a*1)  is  the  same  as  Fff*  except 
that  the  ith  row  label  of  Ff;*(a* )  equals  a*  (i)  instead  of  i,  D  j*  fits  Ff^a*1 ).  □ 


-38- 


Proof  of  Theorem  V.l:  {-»)  It  is  ihown  that  if  D  i:(„+m)  fits  F^F*^,  then 
D 1^+^,)  passes  RB  i^SE  Vjm  and  the  permutation  represented  in  binary  by  D 
is  realized  by  RB \MSE  ljw. 

Recall  that  by  definition,  RB  ^SE  ljn  consists  of  RB  !ai  followed  by  SE  l3IB. 
Because,  by  hypothesis,  D  fits  FvjiF\-m,  D  1mi  fits  Ffo.  RBVjt  maps  the  matrix 
D  i :(,+*)  into  the  matrix  denoted  by  D  ?:<*+*,)  when  D  i;(n+A,)(0.  0  £  i  S  N-\,  is  used  as 
die  routing  tag  for  the  tth  input  of  RB  i*.  Theorem  IV.  1  has  shown  that  any  balanced 
matrix  D  i:*  fitting  the  frame  Ff5,  passes  the  network  RB  1:n.  So,  when  D  lai(i)  is  used 
as  the  routing  tag  for  the  z'th  input  of  RB  i*.  RB  \M  sends  its  ith  input  to  the  output 
whose  value  equals  D  !*(/).  So,  RB  lal  maps  any  Dj*  fitting  the  frame  Fffn  to  /  l3l. 
This  implies  that,  when  D  i^n+my(i)  is  used  as  the  routing  tag  for  the  ith  input  of 
RBhnSEij*,  the  submatrix  D*;jl  of  £*;<„+*,)  is  the  same  as  the  identity  permutation 
matrix  Ziy,.  Therefore,  £*;(„+*,)  is  equal  to  the  balanced  matrix  [/1:n  D(n+iWn+m)].  By 
Lemma  V.l,  SEl:m  realizes  the  permutation  represented  by  D*m+i):(n+w)  and  no 
conflict  occurs  in  the  SBs  of  SE  ly*  when  D*«+i):(a+«,)  (i)  is  used  as  the  routing  tag  for 
the  ith  input  of  SE  ly*.  Therefore,  D1;(n+m)  passes  RB^SE^.  Now,  it  remains  to 
show  that  RB  l:jtSE  iM  realizes  the  permutation  represented  by  D  (n+i):(n+m). 

Let  the  entries  of  D  i ;(H+m)0)  be  denoted  in  binary  by  (x\x2  •  •  •  xj,  •  •  •  xln+m).  The 
fact  that  D\m  of  D*;(n+m)  is  identical  to  I \.H  implies  that  RB  of  RB  1;n5£  1:m  sends 
the  routing  tag  D\:(n+m)(i)  to  the  output  of  RB\:n  whose  value  equals  the  value  of 
(x\xl2  • •  ■  xlH).  Because  the  ;th  output  of  RB  l:n  is  the  same  as  the  yth  input  of  SE  Vjn 
when  RB  \:nSE  lyw  is  considered,  D  1 »  sent  to  the  yth  input  of  SE  lM  by  RB  Vjl, 
where  y  equals  (x\xi  •  •  •  xj,).  Hence,  the  bit  xln+p,  1  ZpZm,  of 

•  •  •  xj,  •  •  •  Xa-hh)  is  used  as  the  control  bit  to  set  a  SB  at  the  pth  stage  of  SE 
where  (xix^  *  *  -  xj,)  and  ( xi»+i*m+2  *  *  •  xi.**)  are  the  addresses  of  the  input  and  the 
destination,  respectively.  Due  to  the  fact  that  D  \:(n+m)  passes  RB  \:nSE  1:m  and  a  SE 
stage  performs  the  shuffle  operation  followed  by  the  exchange  operation,  RB  1:nSE  \:p 
sends  D  w* +»,)(/)  to  the  output  of  RB  \ MSE whose  value  equals  (x‘p+iXp+2  ' '  •  xh+p)- 
Therefore,  the  permutation  represented  by  D<m+ iMa+m)  is  implemented  by 

RB  \vtSE  i-j". 

(«-)  It  is  shown  that,  if  Dv.iH4m)  passes  RB  VjtSE  l3H,  thenD1:(n+m)  fits  FftFiw 
and  RB  iMSE  VjH  realizes  the  permutation  represented  by  D  (,*+!).(„+*•  )• 

Because,  by  hypothesis,  D  passes  RB^SE^,  the  submatrix  Dlai  of 
passes  RBlM.  So,  by  Theorem  IV.l,  the  submatrix  D\:n  fits  FfJ,.  By 
definition,  any  column  of  the  universal  frame  F\m  is  a  single  block  of  size  N.  There¬ 
fore,  any  balanced  matrix  of  order  (Nxm)  fits  F*M.  It  follows  that  D(n+iWn+m)  fits 
Fjja*  Hence,  D  i :(*+*,>  fits  Ff5»F iy*. 

The  first  part  (— ►)  of  the  proof  has  shown  that  the  permutation  represented  by 
£(*+i):(»-h»)  is  implemented  by  RBimSE  Vjm  if  D^H+m)  fits  FfoF**.-  Because  it  is 


-39- 


shown  above  that  D  5ts  v  1 5*F iy*»  RB  \ -nSE  ly*  realizes  the  permutation 

corresponding  to  D(w+1);(n+wl).  □ 

Proof  of  Theorem  V.2:  Proof  is  by  induction  on  m. 

Basis  Step:  Let  m=l.  In  this  step  it  is  proven  that  RB \MSE  \  is  functionally  and 
topologically  equivalent  to  SEi1RB\m*  Recall  that  RB  is  functionally  and  topologi¬ 
cally  equivalent  to  BE i*.  Therefore,  RB  i:nSE  i  is  functionally  and  topologically 
equivalent  to  BE  BE 2m  consists  of  2  copies  of  BE 2*-‘x(i»-t)  in  parallel,  while 

RB  consists  of  2  copies  of  RRy-y*- 1)  in  paraileL  Because  BEy^n-i)  is  func¬ 
tionally  and  topologically  equivalent  to  RB  2*-y«-i),  BE 2m  is  functionally  and  topo¬ 
logically  equivalent  to  RB  Therefore,  BEu*SE\  is  functionally  and  topologi¬ 

cally  equivalent  to  BE \RB  i:(„_i)SE i .  Because  the  last  stage  of  RB\M  is  identical  to 
the  SE  stage,  RB  i^.^SE  i  is  identical  to  RB  lai.  Therefore,  BE \RB  i;(n_i)5E  i  is  func¬ 
tionally  and  topologically  equivalent  to  BE^RB  Due  to  the  fact  that  BE\  is  identi¬ 
cal  to  the  inverse  SE  stage,  BE \RB  1;j|  is  functionally  and  topologically  equivalent  to 
SE\lRB  \.ji.  It  follows  that  RB  \MSE  i  is  functionally  and  topologically  equivalent  to 
SEi'RBVj,. 

Induction  Step:  Assume  that,  for  m  £  2,  the  theorem  holds  for  m— 1,  and  show 
that  it  also  holds  form. 

Because  RB  lai  is  functionally  and  topologically  equivalent  to  BE  1:ilt  RB  1;n5£ 
is  functionally  and  topologically  equivalent  to  fl£i;nS£^.  As  it  is  explained  in  the 
Basis  Step  above,  BE 2m  is  functionally  and  topologically  equivalent  to  RB 
Therefore,  RB^SE^  is  functionally  and  topologically  equivalent  to 
BE\RBi:lH-i)SEiM.  Because  the  last  stage  of  RB  Vj%  is  identical  to  the  SE  stage, 
BE  iRB  i;(»_i)5E  l-jn  is  identical  to  BE\RB  \:nSE  By  the  induction  hypothesis, 

BB\mSE\:(m~i)  is  functionally  and  topologically  equivalent  to  S^m^RB  iM.  So, 
BE \RB  imSE  is  functionally  and  topologically  equivalent  to 

BE xSEib-iyRB  la,.  Because  BE  \  is  identical  to  the  inverse  SE  stage, 
BE !  SEJ^^RB  1;*  is  functionally  and  topologically  equivalent  to  SE^}jnRB  Thus, 
the  theorem  holds.  □ 

Proof  of  Theorem  VTL1:  Case  1:  Let  r~n-\.  When  r=n-l,  T  consists  of  only 
a  reamngeable  network  P(2*\)  and  is  identical  to  the  universal  frame  By 
definition,  any  balanced  matrix  of  order  tfxn  fits  £,vw.  and  P(2*\)  passes  any  balanced 
matrix  of  order  /Vxn.  Therefore,  a  Dsm  fits  £,v^  if  and  only  if  passes  T. 

Case  2:  Let  r=0.  When  r=0,  and  T  are  identical  to  and  RBnm, 
respectively.  Because  Theorem  IV.  1  shows  that  a  DNmi  fits  Fjft*  if  and  only  if  Dsm 
passes  RBnwh  Theorem  VTL1  holds  for  this  case. 


-40- 


Care  3:  Let  1  £ r  £ n-2.  Assume  that  DvwO), 0 £  i  £ iV-1,  is  used  as  the  rout¬ 
ing  tag  for  the  ith  input  of  T. 


(— >)  It  is  shown  that,  if  Av**  Sts  F#£,  then  Dy**  passes  T. 


In  what  follows,  it  is  first  shown  that  die  submatrix  Z)1;(r+1)  of  a  passes  S. 
By  the  definition  of  reairangeability,  any  of  the  2*-r-1  rearrangeable  networks 
P(2'+1!)  of  S  can  pass  any  balanced  matrix  of  order  2/‘*1x(r+l).  Label  these  rear¬ 
rangeable  networks  in  ascending  order  starting  with  0.  Let  />a(2'+1!)  denote  the  ath 
rearrangeable  network  P( 2'+1 !)  of  S,  where  0  £  a  £  2n~r~x-\. 

Consider  the  universal  frame  Fy+l^+i).  Any  column  of  F^^r+i)  is  just  a 
single  block  of  length  2r*1 .  Because  a  column  of  requires  a  column  vector 

of  length  2r+l  to  have  only  2'  zeros  and  2'  ones,  any  column  of  a  balanced  matrix  of 
order  2'+1x(/'+l)  fits  it.  It  follows  that  any  balanced  matrix  of  order  2'+lx(r+l)  fits 
^r2r+l»(r+i)-  Therefore,  Pa(2'*1!)  corresponds  to  the  universal  frame  F^-t-l*^). 
The  subframe  Ff^r+i)  can  be  considered  as  being  a  pile  of  2n~r~1  F^+l^r+1)S.  Label 
these  universal  frames  in  ascending  order  starting  with  0. 

Partition  the  balanced  submatrix  Z>1;(r+i)  of  into  2n~r_1  balanced  subma- 
trices  of  order  2'+lx(r+l)  such  that  the  set  of  the  row  indices  of  the  ath  submatrix  con¬ 
sists  of  the  numbers  (ax2''+1)  to  [(a+l)x2'+1-l]  inclusive.  Label  these  submatrices  of 
order  2/’+1x(r+l)  in  ascending  order  starting  with  0.  Denote  the  ath  submatrix  of 

D  l:(r+l)  by 


By  hypothesis,  fits  Ff%.  This  implies  that  D  i;(r+1)  fits  Ff^'+1).  Therefore, 
D?:<r+i)  fi®*  the  ath  F^+l^r+i).  Because  Pa(2r*1 !)  is  a  rearrangeable  network,  it 
passes  D®<r+i).  that  is,  F°(2  +l !)  sends  its  kth  input  to  the  output  whose  value  equals 
D®^r+lJ  (Jt)  where  0  £  k  £  2'+1-l.  This  implies  that  the  network  5  sends  its  ith  input  to 


its  Jth  output,  where  j  equals  the  sum  of 


and  the  value  of  the  left¬ 


most  (r+1)  bits  of  the  D  lai(i).  Hence,  D  i;(r+1)  passes  S. 


Theorem  IV.  1  shows  that  a  balanced  matrix  C  1;*  that  fits  Ffo  passes  RB 
Theorem  IV.  1  also  shows  that  RB  i;(r+i)  sends  its  ith  input  to  its  hth  output,  where  h  is 


equal  to  the  sum  of 


iff*1  *2r+l 


and  the  value  of  C  i:(r+i)(0-  Thus,  the  networks 


RB  and  S  send  their  ith  inputs  to  their  ;th  outputs,  where  j  equals  the  sum  of 


iff*1  Iff*1 


and  the  value  of  the  ith  row  of  the  matrix  passing  the  corresponding 


network  that  is  either  RB  or  5.  By  definition,  Ff^^  is  the  same  as  F(?+i)». 
This  implies  that  F^^  is  also  the  same  as  F“+ j)ai*  K  follows  from  this  paragraph 
that  the  argument  given  in  the  (-*)  part  of  the  proof  of  Theorem  IV.  1  applies  to 
RB  (,+2**  °f  T  FfjhZiya «•  (If  hi  Theorem  TV.  1  RB  i:(r+i)  and  Ff?(r+1)  are  replaced  by 


-41- 


S  and  FJ5£+i),  respectively,  Theorem  IV.l  becomes  identical  to  Theorem  vn.l). 
Therefore,  Djv**  passes  7. 

(<-)  It  is  shown  that,  if  DNm  passes  7,  then  DNm  fits  Fff&. 

First,  consider  the  submatrix  D  i:(r+i)  of  Dj*.  By  hypothesis,  Av«,  passes  7. 
This  implies  that  D  i:(r+i)  passes  S  because  S  consists  of  2H~r~l  copies  of  a  rearrange- 
able  network  P( 2#,+1!)  in  parallel  and  f>?:(r+i)  passes  />a(2'+1 !).  Because  any  bal¬ 
anced  matrix  of  order  2'+1x(r+l)  fits  a  universal  frame  F^+^r+i).  £?:(/-+n  also  fits 
^2r+l*(/>+i)'  Recall  that  Ff^+i)  can  be  considered  as  a  pile  of  2n_,_1  copies  of 
Fif+l*(r+ !)•  Therefore,  i)  fitsFf^+i). 

Now,  it  is  shown  by  induction  on  0, 1  £  0  £  /i-r-1,  that  D  i;(r+i+p)  fits 
assuming  that  Dy**  passes  7.  (The  proof  presented  below  is  analogous  to  pan  (<— )  of 
the  proof  of  Theorem  IV.l). 

Basis  step:  Let  0=1.  Few  0  £  /  £  2M~r~1-\.,  let  0h(r+i)  (*)  and  denote 

the  2/th  and  (2/+l)th  /?f:(/.+i)S,  respectively.  Similarly,  let  P*  (2r+l !)  and /»“*( 2r+l !) 
denote  the  2/th  and  (2/+l)th  rearrange  able  networks  of  5,  respectively.  Because  the 
stage  RB(,+2)  consists  of  a  pile  erf  2H~r~2  copies  of  the  SE  stage  with  2r+1 
inputs/outputs,  the  subnetwork  that  consists  of  the  pile  of  P**1  (2'+1 !)  and  Pai(2r*1 !) 
is  followed  by  the  SE  stage  with  2r+2  inputs/outputs.  Because  Df;(r+1)  passes 
Pa( 2f*1 !),  Pa(2r*1 !)  sends  its  kth  input  to  its  mth  output,  where  m  equals  the  contents 
of  (k).  Hence,  the  rows  that  are  sent  to  the  kth  outputs  of  />0t,(2r+l!)  and 

/»“*( 2#,+l!)  enter  the  kth  SB  of  the  succeeding  SE  stage  with  2r+2  inputs/outputs.  By 
hypothesis,  passes  7.  This  implies  that  D  \:(r+i)  passes  the  network  consisting  of 
5  followed  by  the  stage  RBr+ 2  without  having  any  conflict  in  the  SBs.  Therefore,  the 
(r+2)th  entries  of  the  rows  that  are  sent  to  the  kth  outputs  of  P**1  (2r+l !)  and 
P°*( 2r+1!)  constitute  the  set  {0,1}.  Notice  that  these  rows  have  the  same  first  k-1 
entries.  Therefore,  the  (r+2)th  entries  of  any  two  identical  rows  of  the  submatrix 

D%(r+\) 

D\:(r+ 1) 

constimte  the  set  {0, 1 }.  Therefore,  by  definition  of  fit,  D  l;(r+2)  fits  Fh{r+2)- 

Induction  step:  Assume  that,  for  2  £  0  £  n-r-l,  D  i:<r+&)  fits  F Then, 
show  that  D also  fits 

Let  2  £  0  £  n-r-l.  By  the  induction  hypothesis,  D  i:(r+0)  fits  Fx#+$).  lx  is  also 
known  that  D#*,  passes  7.  So,  as  D  \:(r+$)  passes  the  network  consisting  erf  5  followed 
by  RB(r+mr+to>  0  i:<r+i+0)  passes  the  network  consisting  of  5  followed  by 

RB<r+ 2«r+l+«* 


-42- 


Partition  the  matrix  D i.(r+0)  into  2*~r~^  submatrices  2n~r~\ 

which  are  labeled  in  ascending  order  starting  with  0.  Let  0  £  u  £  2',-r_ft_1-l,  y1=2u 
and  Y2=2u+1.  The  stage  RB  (r+i+$)  consists  of  copies  of  the  SE  stage  with 

2'+1+0  inpats/outputs.  The  rows  that  are  sent  to  the  rth,  0  £  s  5  2/’+^-l,  outputs  of  the 
subnetworks  that  pass  and  D^^r+P)  enter  the  sth  SB  of  the  SE  stage  with 

2'+1+^  inputs/outputs.  Because  no  conflict  occurs  in  the  rth  SB  by  hypothesis,  the 
(r+l+0)th  entries  of  the  rows  entering  the  sth  SB  must  constitute  the  set  {0, 1 }.  There¬ 
fore,  by  definition  of  fit,  D  l^+i+fl)  fits  +!+&)•  □ 

Proof  of  Corollary  VTL1:  Consider  the  network  T,  that  is  defined  in  Theorem 
VTL1,  and  its  components  5  and  *.  Recall  that  5  consists  of  2n~r~l  copies  of  a 
reanangeable  network  P(2r+l\)  in  par&ileL  If  the  Benes  network^Sy+ixO+i)  substi¬ 
tutes  for  each  reanangeable  network  P(2r+l !)  of  5,  then  5  consists  of  2H~r~l  copies  of 
the  reanangeable  network  in  parallel  and  hence  5  is  made  identical  to 

the  subnetwork  2*5  of  BS^^n-i)-  Because  BS  i;(2*-n  can  be  considered  as 

being  composed  of  BE i:(*-i)  followed  by  RB  i:H,  BS (n-r):(n+r)  is  the  same  as 
2?£  So,  T  is  functionally  equivalent  to  the  network  consisting  of 

2*5 (*_,).(*+,)  followed  by  RB^+i^n-  Because  the  network  that  consists  of  2*5(„_,):(n+r) 
followed  by  RB(r +2y*  is  identical  to  2*5  (*_,);( 2i»-o  and  the  fact  that  a  balanced  matrix 
Dy*,  fits  if  and  only  if  Df/wt  passes  T  (Theorem  VIL1),  Dv*n  fits  if  and  only 
if  passes  B5(m^.):(2ii-i).  Therefore,  the  corollary  holds.  □ 


REFERENCES 

[1]  D.  H.  Lawrie,  “Access  and  alignment  of  data  in  an  array  processor,”  IEEE 
Trans,  on  Compos.,  vol.  c-24,  pp.  1 145-1155,  Dec.  1975. 

[2]  N.  I  iniai  and  M.  Tarsi,  “Interpolation  between  bases  and  the  Shuffle-Exchange 
network,”  Europ.  J.  Combinatorics,  vol.  10,  pp.  29-39, 1989. 

[3]  J.  A.  Boody  and  U.  S.  R.  Murty,  Graph  Theory  wish  Applications,  North-Holland, 
New  York,  1979. 

[4]  L.  Lovasz  and  M.  D.  Plummer,  Matching  Theory,  Annals  of  Discrete  Math.,  29, 
North-Holland,  1986. 

[5]  H.  S.  Slone,  “Parallel  processing  with  the  perfect  shuffle,”  IEEE  Trans,  on  Com- 
put.,  vo L  c-20,  pp.  153-161,  Feb.  1971. 

[6]  D.  P.  Agrawal,  “Graph  theoretical  analysis  and  design  of  multistage  interconnec¬ 
tion  networks,”  IEEE  Trans.  Compos.,  voL  c-32,  pp.  637-648,  July  1983. 


-43- 


ffl  V.  E.  Benes,  Mathematical  Theory  of  Connecting  Networks  and  Telephone 
Traffic,  Academic  Press,  New  York,  1965. 

[8]  A.  Waksman,  “A  permutation  network,”  Journal  of  the  ACM,  vol.  15,  No.  1,  pp. 
159-163,  Jan.  1968. 

[9]  C.  P.  Kruskal  and  M.  Snir,  “A  unified  theory  of  interconnectioR  network  struc¬ 
ture,”  Theoretical  Computer  Science,  vol  48,  pp.  75-94,  1986. 

[10]  K.  Y.  Lee,  ‘‘On  the  rearrangeability  of  2(log  N)-l  stage  permutarion  networks,” 
IEEE  Trans.  Comput.,  vol.  c-34,  pp.  412-425,  May  1985. 

[11]  C.  Wu  and  T.  Feng,  “On  a  class  of  multistage  interconnection  networks,”  IEEE 
Trans,  on  Comput.,  vol.  c-29,  pp.  696-702, 1980. 

[12]  T.  Etzion  and  A.  Lempei,  “An  efficient  algorithm  for  generating  linear  transfor¬ 
mations  in  a  shuffle-exchange  network,”  SIAM  J.  Comput.,  Vol.  15,  No.  1,  pp. 
216-221,  Feb.  1986. 

[13]  C.  Clos,  “A  study  of  non-blocking  switching  networks,”  Bell  System  Tech.  J., 
vol.  32,  pp.  406424, 1953. 

[14]  T.Y.  Feng,  “Data  manipulating  functions  in  parallel  processors  and  their  imple¬ 
mentations,”  IEEE  Trans,  on  Comput.,  vol.  c-23,  pp.  309-318,  March  1974. 

[15]  ICE.  Batcher,  “The  flip  network  in  STARAN,”  in  Proc.  1976  Int.  Conf.  Parallel 
Processing,  pp.  65-71. 

[16]  L.R.  Goke  and  GJ.  Lipovsld,  “Banyan  networks  for  partitioning  multiprocessor 
systems,”  in  Proc.  1st  Symp.  Comput.  Architecture,  Dec.  1973,  pp.  175-189. 

[17]  M.C  Pease  HI,  “Die  indirect  binary  n-cube  multiprocessor  array,”  IEEE  Trans, 
on  Comput.,  vol  c-26,  pp.  458473,  May  1977. 

[18]  A.Y.  Oruc  and  M.Y.  Oruc,  “Equivalence  relations  among  interconnection  net¬ 
works,”  J.  of  Parallel  and  Distributed  Computing,  vol.  2,  pp.  3049, 1985. 

[19]  H.  Cam  and  J.A.B.  Fortes,  “Permutation  routing  algorithms  of  frequently  used 
networks,”  in  preparation. 

[20]  JJ.  Rotman,  The  theory  of  groups:  An  introduction,  AUyn  &  Bacon,  Rockleigh, 
NJ.,  1965. 

[21]  HJ.  Siegel,  Interconnection  Networks  for  Large-Scale  Parallel  Processing,  Lex¬ 
ington  Books,  1986. 

[22]  A.  Youssef  and  B.  Arden,  “A  New  Approach  to  Fast  Control  of  r2xr2  3-stage 
Benes  Networks  of  nr  Crossbar  Switches,”  1990  Comput.  Archit.  Conf.,  pp.  50- 
59. 

[23]  J.  Lenfant,  “Parallel  permutations  of  data:  A  Benes  network  control  algorithm  for 
frequently  used  permutations,”  IEEE  Trans,  on  Comput.,  voL  c-27,  pp.  637-647, 
July  1978. 


-44- 


[24]  D.  Nassimi  and  S.  Sahni,  “A  self-routing  Benes  network  and  parallel  permutation 
algorithms.”  IEEE  Trans,  on  Comput.,  vol.  c-30,  pp.  332-340,  May  1981. 

[25]  C.S.  Raghavendra  and  R.V.  Boppana,  ‘‘On  Self-Routing  in  Benes  and  Shuffle- 
Exchange  Networks,”  IEEE  Trans,  on  Comput.,  vol.  40,  No.  9.  pp.  1057-1064, 
Sept.  1991. 

[2 6]  J.C  Bcrmood,  J.M.  Foumeau  and  A.  Jean-Marie,  ‘‘A  graph  theoretical  approach 
to  equivalence  of  multistage  interconnection  networks,”  Discrete  Applied 
Mathematics,  voL  22,  pp.  201-214,  1989. 

[27]  D.M.  Dias,  Packet  Communication  in  Delta  and  Related  Networks,  PhJD.  Thesis, 
Rice  Univ.,  1981. 

[28]  H.  Cam,  Design  and  permutation  routing  algorithms  of  rearrangeable  networks, 
PhJD.  Thesis,  Purdue  Univ.,  May  1992. 

[29]  TJL  Cormcn,  GE.  Leiserson  and  RJL.  Rivest,  Introduction  to  Algorithms,  MIT 
Press,  1991. 


REFERENCE  NO.  20 


Cam,  H.  and  Fortes,  J.  A.  B.,  "New  Routing  Algorithms  for  Benes  and  Reduced 
QnQn  Networks,"  Submitted  for  publication  in  IEEE  Transactions  on  Computers. 

Note  -  this  paper  presents  algorithms  for  the  routing  of  messages  in  two  important 
rearrangeable  networks.  They  can  be  used  to  set  processor  array  systems  into 
configurations  that  exclude  faulty  components. 


NEW  ROUTING  ALGORITHMS  FOR  BENES 
AND  REDUCED  NETWORKS 


Hasan  (Jam  and  Jose'A.  B.  Fortes 

School  of  Electrical  Engineering 
Purdue  University 
West  Lafayette,  IN  47907-1285 

ABSTRACT 

In  this  paper,  new  permutation  routing  algorithms  are  presented  for  the 
Benes  and  the  reduced  networks.  These  algorithms  first  compute  a 

routing  tag  of  2n— 1  bits  for  every  input,  then  set  the  switches  one  stage  at  a 
time,  the  setting  of  a  switch  in  stage  j  being  determined  by  the  jth  bit  of  the 
routing  tags  at  its  inputs.  The  last  n  bits  of  any  routing  tag  are  the  destination 
address  bits.  The  serial  and  parallel  time  complexities  of  these  algorithms  are 
0(N\og2N)  and  OOogfN),  respectively.  They  are  the  same  or  better  than 
those  of  previously  proposed  routing  algorithms.  In  comparison  with  other 
approaches  of  similar  asymptotic  complexity,  the  algorithms  in  this  paper  are 
simpler  and  easier  to  understand  and  program. 

Index  Terms —  Interconnection  network,  permutations,  routing  algo¬ 
rithm,  frames,  balanced  matrices,  rearrangeability. 


This  rnnr-h  «u  supported  in  part  by  the  Office  of  Naval  Research  under  contract  No.  00014-00- J- 1483  and  in 
pari  by  the  Innovative  Science  and  Technology  Office  of  the  Strategic  Defense  Initiative  Organisation  and  was 
administered  through  the  Office  of  Naval  Research  under  contract  No.  00014-88-k-0723. 


I.  Introduction 

An  interconnection  network  (IN)  is  used  for  exchanging  of  information 
among  the  computation  nodes  and  memory  modules  in  a  parallel  computer.  An 
IN  is  often  required  to  realize  any  permutation  between  processors  and 
processors /memory  modules  because  many  parallel  algorithms  are  designed  on 
the  basis  of  one-to-one  type  data  transfers.  If  an  IN  realizes  any  permutation  in 
one  pass,  then  it  is  called  rearrangeable.  A  multistage  IN  of  2x2  switching 
boxes  (SBs)  must  have  at  least  2n— 1  stages  to  be  rearrangeable  [7].  A  disad¬ 
vantage  of  this  type  of  networks  is  the  fact  that  their  routing  algorithms  have 
costly  implementations.  This  paper  presents  new  simple  routing  algorithms  to 
realize  permutations  in  these  networks.  These  algorithms,  which  make  use  of 
balanced  matrices  [2],  frames  [18]  *~d  perfect  matching  graphs  [3,4],  take 
0(N\ogN)  time  on  a  uniprocessor  computer  and  0(log2N)  on  a  parallel  com¬ 
puter  of  N  processors.1 

Routing  algorithms  for  Benes  network,  a  well-known  rearrangeable  net¬ 
work  [6],  are  described  in  [13,14,15,16].  The  control  algorithm  presented  in  [16], 
called  looping  algorithm,  determines  the  settings  of  the  SBs  recursively  starting 
with  outermost  stages  and  heading  towards  the  center  stage.  This  algorithm 
takes  0(N\ogN)  time  on  a  uniprocessor.  Nassimi  and  Sahni  [15]  proposed  a 
parallel  algorithm  which  determines  the  switch  settings  in  0(log2N)  time  when 
a  fully  interconnected  network  of  N  processors  is  used.  Lee  [13]  introduced  a 
non-recursive  algorithm  which  sets  SBs  stage  by  stage  from  left  to  right.  The 
advantages  of  a  non-recursive  algorithm  include  a  reduction  of  information 
transfer  among  the  chips  in  VLSI  Benes  network  implementations  and  the  pipe¬ 
lining  of  consecutive  permutations  so  that  the  routing  time  is  reduced  by  a  fac¬ 
tor  of  O(logN).  Taking  into  consideration  the  number  of  comparisons  needed 
to  pair  switches  such  that  every  consecutive  pair  has  a  common  residue  class 
number  (see  [13]  and  Section  III  of  this  paper),  the  time  complexity  of  the  algo¬ 
rithm  is  0(N2logN)  and  0(N2)  on  a  uniprocessor  and  a  parallel  computer, 
respectively.  The  routing  algorithm  proposed  in  this  paper  for  Benes  network  is 
also  non-recursive,  but  it  takes  0(N]ogN)  time  on  a  uniprocessor  computer  and 
0(log2JV)  on  a  parallel  computer.  In  addition,  it  is  easy  to  comprehend  and 
simple  to  program. 

Lee  [9]  has  proven  the  rearrangeability  of  the  network  called  the  reduced 
which  is  composed  of  the  (n— l)-stage  SE  network  followed  by  the  n- 
stage  inverse  SE  network.  The  routing  algorithm  proposed  in  [9]  for  reduced 
ritflV  takes  the  same  time  as  the  one  proposed  in  [13]  for  the  Benes  network 


1  All  logarithms  art  ia  baa*  2  unless  stated  otherwise. 


-  2  - 


discussed  in  the  above  paragraph.  A  new  proof  for  the  rearrangeability  of  this 
network  is  also  provided  in  this  paper.  This  proof  directly  leads  to  a  routing 
algorithm  which  is  very  similar  to  the  routing  algorithm  of  Benes  network.  This 
proposed  algorithm  also  takes  O(TVlogTV)  time  on  a  uniprocessor  computer  and 
0(log2  TV)  on  a  parallel  computer. 

The  remaining  of  this  paper  is  organized  as  follows.  Section  II  is  dedicated 
to  the  basic  terminology  and  definitions  used  throughout  the  paper.  Section  III 
presents  basic  serial  and  parallel  algorithms  used  in  the  routing  algorithms  of 
the  Benes  and  reduced  n^n^1.  Section  IV  is  dedicated  to  the  routing  algo¬ 
rithm  of  the  Benes  network.  Section  V  discusses  the  routing  algorithm  of  the 
reduced  QtfQj/ .  Concluding  remarks  are  made  in  Section  VII.  The  proofs 
omitted  in  the  paper  are  presented  in  the  Appendix  (Section  VI). 


n.  BASIC  DEFINITIONS 

Throughout  this  paper,  matrices  are  denoted  by  single  capital  letters  and 
columns  of  a  matrix  are  represented  by  the  lower  case  of  the  capital  letter 
denoting  that  matrix.  Matrix  A  having  TV  rows  and  k  columns  is  denoted  by 
ANxk.  Given  a  matrix,  e.g.  Ajyx*,  the  jth  column  is  denoted  by  a„l<j<k. 
To  be  able  to  refer  to  a  set  of  specific  columns  of  a  matrix,  the  notation  AXy  is 
used  to  denote  the  submatrix  that  contains  those  columns  of  A  whose  indices 
are  x,  x+1,  *  *  *  ,  y,  where  1  <  x  <  y;  if  x  happens  to  be  greater  than  y,  then 
Az  y  ref  s  to  a  nil  matrix,  unless  stated  otherwise.  If  z=y,  then  Az:y  refers  to  a 
single  column  ax.  Unless  specifically  stated,  the  number  of  the  rows  of  a  matrix 
Ax  y  is  assumed  to  be  equal  to  TV.  Ajvxn(i)  refers  to  the  »th  row  of  the  matrix 
Ayvxn,  where  0  <  i  <  TV— 1.  A  column  vector  of  Gentries  of  which  half  are  0’s 
and  the  other  half  are  l’s  is  called  a  column  permutation.  Unless  otherwise 
stated,  any  column  of  any  matrix  in  this  paper  is  a  column  permutation.  The 
binary  representation  of  a  positive  integer  0  <  b  <  TV— 1  is  (&i&2  ’**&»»)  such 
that  b  =  61.2rt_1  +62.2n_2  +  •  *  •  +6„.2°. 

A  permutation  on  a  set  X  is  a  bisection  of  X  onto  itself.  A  permutation  / 
permutes  the  ordered  list  0,  1,  •  •  • ,  TV— 1  into  /(0),  /(l),  *  *  *  ,  /  (TV — 1).  A 
cyclic  notation  [19,20]  can  be  used  to  represent  a  permutation  as  the  product  of 
cycles,  where  a  cycle  (c0  cx  c2  *  *  *  c*_j  c*)  means  /(c0)  —  clt  /(cj)  =  c2, 
•  •  •  ,  / (cfc _i )  =  c h,  and  / (c*)  =  c0.  The  composition  of  several  permutations 
f\  fi  '  '  '  fk  is  evaluated  from  left  to  right,  i.e.,  it  maps  i  into 
/*(••*(/*(/»  («■)))■••). 

Definition  n.  1.  (Permutation  matrix,  identity  permutation 
matrix,  reverse  permutation  matrix):  A  permutation  h  can  be 


-  3  - 


represented  by  a  Nxn  binary  matrix  called  permutation  matrix,  II,  such  that  its 
j'th  row,  //jvXn(0»  1S  the  binary  representation  of  the  integer  h(i).  The  identity 
permutation  matrix  denoted  by  INxn  is  the  matrix  whose  ith  row  is  the  binary 
representation  of  i  (this  is  called  “standard  matrix”  in  [I2j).  The  reverse  per¬ 
mutation  matrix,  denoted  Rtfxn,  is  the  matrix  whose  j'th  column  is  the 
(n+1  —  j)th  column  of  /yvxn- 

In  the  terminology  used  in  this  paper,  a  fc-stage  IN  consists  of  k  columns  of 
switching  boxes  (SBs),  each  followed  and  preceded  by  links  which  form  inter¬ 
connection  patterns  (IPs)  as  shown  in  Figure  II.  1.  The  IPs  formed  by  the  input 
and  output  links  are  denoted  by  IPin  and  IPout,  respectively.  Thus,  an  IN  con¬ 
tains  (£+1)  interconnection  patterns  labeled  /Ptn,  IP j,  IP2,  .  .  .  , /P*^,  /Pout. 
A  column  of  IN  contains  N/ 2  (2x2)  SBs,  each  of  which  can  be  set  either 
straight  or  cross.  Figures  II.2,  II. 3,  and  II.4  show  several  networks  considered  in 
this  paper  for  N= 16,  namely,  reverse  baseline,  baseline,  Benes,  the  4-stage 
shuffle-exchange  (SE),  and  the  4-stage  inverse  SE.  If  some  networks  are  placed 
in  parallel  to  form  a  new  IN,  then  the  IN  is  said  to  be  a  “pile  of  networks”. 
Unless  otherwise  stated,  any  IN  is  assumed  to  have  N  inputs/outputs  and  its 
stages  are  labeled  from  left  to  right  starting  with  1.  Network  stages  are  defined 
below  and  illustrated  in  the  figures. 

Definition  II.  2.  (Stages  of  reverse  baseline,  baseline,  Benes,  SE, 
and  inverse  SE  networks):  With  one  exception,  a  stage  in  the  reverse  base¬ 
line  and  SE  networks  consists  of  a  connection  pattern  and  the  following  column 
of  SBs.  The  exception  is  the  rightmost  stage  (i.e.,  the  output  stage)  which  con¬ 
sists  of  the  last  column  of  SBs  and  both  the  preceding  and  succeeding  connec¬ 
tion  patterns.  Stages  are  labeled  from  left  to  right  in  ascending  order  starting 
with  1.  In  the  baseline  network  the  fcth  stage  corresponds  to  the  (n—  fc+l)th 
stage  of  the  reverse  baseline  network.  (Notice  that  both  the  reverse  baseline 
and  the  baseline  can  have  at  most  n  stages,  by  definition).  In  the  inverse  SE 
network  with  m  stages,  its  fcth  stage  corresponds  to  the  (m—  fc+l)th  stage  of  the 
m- stage  SE  network.  In  this  paper,  Benes  network  is  considered  as  being  com¬ 
posed  of  the  first  n— 1  stages  of  the  r»-stage  baseline  followed  by  the  n-stage 
reverse  baseline.  (It  could  also  be  considered  as  being  composed  of  the  n-stage 
baseline  followed  by  the  last  n— 1  stages  of  the  n-stage  reverse  baseline).  There¬ 
fore,  the  stages  of  Benes  network  are  labeled  according  to  the  labeling  rules  of 
the  baseline  and  the  reverse  baseline. 


-  4  - 


I 

N 

P 

U 

T 

S 


lPin  SBs  IP, 


SBs 


SBs  IPa 


- 

=0= 

- 

3J= 

• 

• 

=0= 

• 

• 

• 

N-r 

Column 


tCt": =04 


-  0 
-  1 

-  2 
“  3 

-  4 

-  5 


h  N-2 
J-  N-l 


O 

U 

T 

P 

U 

T 

S 


Figure  II.  1.  An  IN  with  (2x2)  SBs  and  interconnection  patterns  shown  as  large  boxes. 


Figure  II.2.  (a)  The  4-stage  reverse  baseline  network  with  16  inputs/outputs,  (b)  The  4- 

stage  baseline  network  with  16  inputs/outputs. 


Figure  H.3. 


Benes  network  with  16  inputs/outputs. 


-  5  - 


Figure  11.4.  (a)  The  omega  network  (i.e.,  the  4-stage  SE)  with  16  inputs/outputs,  (b)  The 

inverse  omega  network  (i.e.,  the  4-stage  inverse  SE)  with  16  inputs/outputs. 


An  IN  with  N  inputs/outputs  and  k  stages  is  denoted  by  both  INtfxk  and 
IN i;*,  where  k  >  1.  The  subnetwork  that  consists  of  the  stages  x  through  y  of 
INlk  is  denoted  by  INZ  y,  where  1  <  x<y  <  k.  If  x >y,  then  INzy  refers  to  a  nil 
network,  unless  specified  otherwise.  IN,,  1  <  /  <  fc,  refers  to  the  jth  stage  of 
INi  k.  The  notation  used  for  networks  is  different  from  that  used  for  matrices 
because  matrices  are  always  denoted  by  single  letters. 

In  this  paper  the  following  convention  is  adopted  to  denote  an  IN:  if  the 
name  of  an  IN  has  more  than  one  word,  then  it  is  denoted  by  the  upper  case 
form  of  the  first  letters  of  those  words;  otherwise,  it  is  denoted  by  the  upper 
case  form  of  its  first  and  last  letters.  Also,  if  XX  denotes  an  IN,  then  the  inverse 
XX  network  may  be  denoted  by  XX~l .  The  following  •  efinition  applies  this 
convention  to  the  baseline,  reverse  baseline,  shuffle-exchange,  inverse  shuffle- 
exchange  and  Benes  networks  of  interest  in  this  paper. 

Definition  II. 3.  (BE,  RB,  SE,  SE-1,  BS,  composite  IN,  IP*):  The 
symbols  BE,  RB,  SE,  SE~l  and  BS  in  this  paper  refer  to  the  networks  base¬ 
line,  reverse  baseline,  shuffle-exchange,  inverse  shuffle-exchange  and  Benes, 
respectively.  If  an  IN  is  a  cascade  of  different  INs,  then  it  is  called  a  composite 
IN  and  is  denoted  by  the  concatenation  of  symbols  that  represent  the  INs  in  the 
order  they  are  cascaded.  IP*  denotes  the  shuffle  interconnection  pattern  of  a  SE 
stage  of  N  inputs/outputs. 

As  an  example  for  a  composite  network,  the  notation  RB x  nSE \:m,  m  >  1, 
denotes  the  network  consisting  of  RB  1:n  followed  by  SE\  m.  The  reduced 
nNn~Nl  can  also  be  denoted  by  both  SEi^.^IP*  SEi^n  and 


-  6  - 


SEfl!>.(n-\)IPS  SE  Nxn- 

Linial  and  Tarsi  (2]  introduced  the  concept  of  balanced  matrices  to  estab¬ 
lish  a  relation  between  SE  networks  and  their  realizable  permutations.  The  fol¬ 
lowing  definition  is  equivalent  to  the  one  given  in  [2]. 

Definition  II. 4.  (Balanced  matrix):  Let  N=  2"  and  call  a  0-1  matrix 
i4jvv*  balanced  if  either  one  of  the  following  conditions  is  satisfied: 

1.  For  k  <  n,  it  consists  of  any  k  columns  of  the  binary  representation  of  a 
permutation  on  the  set  (0, 1,..., N— l}. 

2.  For  k>n,  every  n  consecutive  columns  form  the  binary  representation  of 
a  permutation  on  the  set  {0, l). 


As  an  example,  two  balanced  matrices  E  and  H  are  shown  below.  But 


E  =  e i  <2  {3  — 


1  1  1 

1  0 

1  0  0 

1  0 

0  0  0 

1  1 

0  1  0 
1  1  0 

H  = (A,  A,]  = 

1  1 
0  1 

0  0  1 

0  1 

1  0  1 

0  0 

0  1  1 

0  0 

Definition  II. 6.  (Pass,  realize):  A  balanced  matrix  A/vx*  (respectively, 
an  IN)  is  said  to  pass  a  fc-stage  IN  (respectively,  a  matrix  Ajv**)  if  no  conflict 
occurs  in  the  SBs  of  the  IN  when  A^x*(»)  is  used  as  the  routing  tag  for  the  ith 
input  of  the  IN.  A  network  IN  realizes  a  permutation  represented  by  BfjXn  if 
there  is  a  network  switch  setting  such  that  input  » is  sent  to  output  B(i)  for  all 
i  =  0,1,  ...  ,  N— 1. 

According  to  the  last  definition,  in  this  paper,  the  phrases  “an  IN  passes  a 
balanced  matrix”  and  “a  balanced  matrix  passes  an  IN”  are  used  alternatively. 
It  is  also  assumed  that  only  "one  pass"  is  allowed  through  a  network  to  realize  a 
permutation.  Therefore,  the  phrase  "one  pass"  is  omitted  in  the  sequel.  To 
emphasize  the  distinction  between  the  meaning  of  the  terms  “pass”  and  “real¬ 
ize”  as  used  in  this  paper,  it  is  important  to  notice  that  matrix  Ajv^  in 
Definition  II.5  does  not  necessarily  correspond  to  the  permutation  realized  by 
the  network  IN.  Indeed,  the  rth  row  of  Ayvx*  is  the  routing  tag  for  input  :  and  it 
is  only  when  it  equals  the  destination  of  input  i  that  A/yx*  is  the  permutation 
realized  by  IN]  these  cases  will  become  clear  in  the  remainder  of  the  paper. 


In  the  rest  of  this  section,  the  concept  of  frames  is  introduced  to  character¬ 
ize  the  permutations  realized  by  a  network.  Different  frames  are  derived  from 
this  concept  and  their  graphical  representations  are  presented.  In  addition, 
some  related  fund*  mental  concepts  used  in  the  proofs  of  this  paper  are  intro¬ 
duced.  More  extended  discussion  of  these  concepts  appears  in  1 1 7 j . 

In  order  to  understand  the  concept  of  frame,  the  following  definition  is  first 
introduced  (a  k-tuplt  V  with  the  elements  Vj  ,  u2>  .  .  .  ,  v*,  denoted  by 
V  =  <vi,t>2,  .  .  .  ,  v*>,  refers  to  an  ordered  collection  of  k  elements). 

Definition  13.8.  (Partition  Pif  block,  standard  partition  P;  ,  P  ): 
Let  X={0,  1,  .  .  .  ,  jV— l),  N=2n  and  t  =1,2,  .  .  .  ,  n.  A  partition  F,  of  X  is  a 
tuple  of  2n~'  disjoint  ordered  subsets  of  X,  called  blocks,  each  of  which  is  a 
tuple  with  2'  distinct  elements.  The  partition  P,  =<  </i,  h+\,  .  .  .  ,h+2'— 1> 
such  that  h  mod  2'=  0  and  h  =  0,1,  .  .  .  ,N— 1>  is  a  standard  partition  of  X. 
The  n—  tuple  </*,*,  t  =1,2,  .  .  .  ,  n>  is  denoted  by  P\ 

Example  II.  1.  Let  N=S.  The  following  are  the  standard  partitions: 

P[=<  <0, 1  >,  <2,3>,  <4,5>,  <6, 7>  >,  Pi=<  <0, 1,2,3>,  <4,5,6,7>  > 
and  P3  =<0, 1,2,3,4,5,6,7>.  Also,  P*=<P[,  Pi  P3*>. 

The  notion  of  frame  is  defined  next  and  an  example  (Example  II. 2)  is  given 
after  the  definition.  Several  frames  are  graphically  represented  in  Figure  II.5. 
Notice  that  the  frames  are  characterized  by  the  labeling  of  columns,  the  label¬ 
ing  of  rows,  and  the  number  of  blocks  in  columns.  Therefore,  the  definition  of 
frame  is  done  in  terms  of  two  mappings  (the  column  and  row  labeling)  and  a 
tuple  of  partitions  (one  for  each  column).  The  column  labels  determine  the 
number  and  size  of  the  blocks  in  each  partition  and  the  row  labeling  determines 
the  elements  in  each  block  and  their  order.  As  precisely  stated  in  the  definition, 
column  with  label  0(i)  corresponds  to  a  partition  with  2n~ ^  blocks  with  2^ 
elements  each  and  the  mth  element  within  the  jth  block  corresponds  to  the 
label  'y(r)  of  row  r  =  2r)^{j— l)-fm—  1).  After  Example  II.2,  a  convenient 
graphical  representation  for  frames  is  introduced  and  its  use  is  illustrated  in 
Example  II.3  for  the  frames  described  in  Example  II.2. 

Definition  11.7.  (Frame):  Let  1  <  k  <  n  and  1  <  »  <  k.  A  frame 
I  <  ik  <  n,  is  a  3-tuple  </?,'7,P>,  where 

—  0  is  a  mapping  of  the  set  {l,2,  .  .  .  ,k)  into  (l,2,  .  .  .  ,  n}, 

—  'I  is  a  permutation  on  the  set  {0,1,  ...  ,  N—  1}  and 

—  P  is  a  tuple  of  partitions  <P^(i),P^(2),  •  ■  determined 

by  0  and  7  as  follows: 


-  8  - 


l 


P ;i(i)  —  <^>J(i'),i>'P,/(t').2>  •  •  ■  i  P,i{i),2' where 
=  <ujiy» u2,y>  •  •  •  »  u2 such  that 
um  j  =  7(2,/(,)(j— l")+m— l),  1  <  j  <  2n-;,(,)  and  1  <  m  <  2^. 


Definition  II. 8.  (a-frame,  standard  a-frame,  b-frame,  standard  b- 
frame):  Consider  the  3-tupie  <,3,7,  P>  that  defines  a  frame  /VxA-  If  P 's  the 
identity  permutation,  then  F/vx*  is  an  a— frame  denoted  by  If  fi  and  7 

are  the  identity  permutations  (which  implies  P=P  ),  then  F^jxk  |S  the  standard 
a-frame  denoted  by  F$x *. 

By  definition  of  standard  a-type  frame,  column  j\a,  1  <  1  <  n,  has  2n-t 
blocks,  each  having  2'  rows.  Unless  otherwise  stated,  the  number  of  the  rows  of 
Fhkt  k  1S  assumed  to  be  N.  The  notation  Fz  y  is  used  to  denote  the  sub- 
frame  that  contains  those  columns  of  F  whose  indices  are  x,  x+1,  .  .  .  ,  y. 
Unless  specifically  stated,  the  number  of  rows  of  Fx:y  is  assumed  to  be  N. 


Example  II.2.  The  following  are  examples  of  frames  for  N=8  and  A:  =3. 

(а)  FgX3=</?,7,P>  where  /?  =  ( 1  2)(3),  7  is  the  identity  permutation  and 
P—<P2,PltP3>  such  that  Px=Pl,  P2=P\  and  P3=P3. 

(б)  F3x3=<J3,~i,P>  where  (3  =  identity  permutation,  7  —  (0)(l  2)(3)(4)(5)(6)(7), 

p=<p,  P2,P3>,  Pj=<  <0,2>,  <1,3>,  <4,5>,  <6,7>  >, 

P2=<  <0,2,1,3>,  <4,5,6,7>  >  and  P3=<0,2,1,3,4,5,6,7>. 

(c)  PgX3=</?,7,P>  where  (3  =  identity  permutation,  7  =  (0)(1  3  6  4)(2  5)(7), 
P=<P1P2,P3>,  Pi=<  <0,3>,  <5,6>,  <1,2>,  <4,7>  >, 


P2=<  <0,3,5,6>,  <1,2,4,7>  >  and  P3=<0,3,5,6,1,2,4,7>. 


(d)  FSx3=<{3,1,P>  where  0  - 


1  2  3 

2  2  3 


,  7  is  the  identity  permutation, 


P  =<P 2 > P2 1 P3 >*  P 2“P 2  and  P 3  — P 3 . 


Definition  II. 0.  (Graphical  representation  of  a  frame,  rectangle  of 
a  frame):  The  graphical  representation  of  a  frame  P/Vx*  =*</?,  7,  P>  consists  of 
k  columns  labeled  /,  ,  :  =1,  .  .  .  ,k,  from  left  to  right  and  N  rows  labeled  7 (jf), 
j  =  0,1, .  .  .  ,iV— 1  starting  at  the  top.  The  column  /,  corresponds  to  the  parti¬ 
tion  Pd(i)t  that  is,  /,  consists  of  2n~^  blocks  of  2^  entries  each.  In  the 
graphical  representation  of  a  frame,  any  polygon  with  four  sides  and  four  right 
angles  is  a  rectangle  of  the  frame. 


Example  II.3.  Figures  m.la,  ID. lb,  m.lc  and  m.ld  show  the  graphical 
representation  of  the  frames  described  in  the  part  (a),  (6),  (c)  and  (d)  of  Exam¬ 
ple  II.2,  respectively.  Figure  II.5e  shows  the  graphical  representation  of  the 
standard  a-frame  Pgx 3.  The  labels  of  the  partitions  below  each  column  are 
implicit  by  the  sises  of  the  rectangles  in  the  column  and  can  be  omitted. 


-  9  - 


/ 1  h  h  /?  /?  n  /?  n  n 


/.  /l  h  fx  I?  I? 


Pi?  \P$  p  \  P2P3 

(a)  (b) 


P1P2P3 

(c) 


P2P 2 P  Z 
(<0 


P>P2P3 

(e) 


Figure  11.5.  (a),  (A),  (e)  and  (d)  are  the  graphical  representations  of  the  frames  described  in 

the  part  (a),  (A),  (c)  and  (d)  of  Example  11.2,  respectively,  (e)  Graphical 
representation  of  the  standard  a-frame  ■, . 


Definition  11.10.  (Fit):  Let  k  >  1,  0  <  1  <  N— 1  and  1  <  j <  k.  Con¬ 
sider  a  balanced  matrix  A^xk  and  a  frame  F  ^xk.  The  matrix  A  fits  Ff^xk  if  and 
only  if,  after  placing  a,;  in  the  ith  row  and  jth  column  of  Fpjxk,  every  rectangle 
of  Ftfxk  contains  a  balanced  matrix. 

Example  H.4.  The  matrix  E,  shown  just  after  Definition  II.4,  fits  all  the 
frames  shown  in  Figure  II.5  (a)-(d).  It  does  not  fit  Fg“3  shown  in  Figure  II.5e 
because,  for  example,  the  submatrix  in  the  top  leftmost  rectangle  (the  2-tuple 
P i(1)  is  not  balanced. 


m.  BASIC  ALGORITHMS 

This  section  introduces  sequential  and  parallel  algorithms  for  determining 
a  column  vector  x  for  given  balanced  matrices  Af/^n^i)  and  5;vx(r»-i)  suc^ 
that  x]  and  x]  are  balanced.  These  algorithms  are  later  used 

in  obtaining  the  routing  algorithms  for  the  Benes  and  SE i  SE\  ln  net¬ 

works.  Some  preliminary  results  and  basic  definitions  are  introduced  first. 

Lemma  m.l.  [2j  For  n  >  2,  let  A,  and  B  be  two  Nx(n—  1)  balanced 
matrices.  Then  there  exists  a  column  vector  x  such  that  both  \A  x\  and  [x  B\ 
are  balanced  matrices. 

Note  that,  when  the  order  of  columns  in  a  balanced  matrix  with  at  most  n 
columns  is  changed,  the  matrix  remains  balanced.  Therefore,  the  position  of  z 
in  the  matrices  A  and  B  in  Lemma  HI.  1  is  immaterial.  Because  the  possible 
choices  of  vector  x  increase  as  the  number  of  columns  of  A  or  B  is  reduced, 
Lemma  III.  1  remains  valid  when  A  and  B  have  less  than  n— 1  columns. 


-  10  - 


Some  properties  of  balanced  matrices  can  be  captured  by  graphs.  A  graph 
G=(  V,E )  consists  of  a  set  of  vertices  V  and  a  set  of  edges  E,  each  of  which  is  a 
pair  of  vertices.  The  union  of  two  graphs  C?  j  =(  V,  £T  j )  and  G2=(V,E2)  is  the 
graph  G=G  x\JG2—(  V,Ex\JE2).  In  other  words,  an  edge  is  present  in 
G=Gi\jG2  if  and  only  if  it  is  present  in  either  Gx  or  G 2.  A  subset  M  of 
edges  in  a  graph  G  is  called  independent  or  a  matching  if  no  two  edges  of  M 
have  a  vertex  in  common.  A  matching  M  is  said  to  be  a  perfect  matching  if  it 
covers  all  vertices  of  G.  More  extended  discussion  of  these  basic  concepts  can  be 
found  in  [3,4]. 

Definition  m.l.  (Perfect  matching  graph  of  a  matrix):  Let  A  be  an 
Nxk  (1  <  A:  <  n—  1,  n  >  2)  balanced  matrix.  A  perfect  matching  graph  of  A, 
denoted  by  PGA,  is  a  graph  whose  vertices  are  in  one-to-one  correspondence 
with  the  rows  of  A,  have  degree  one  and  vertices  v,  and  t/;  are  joined  by  an 
edge  only  if  the  ith  row  and  jth  row  of  A  are  identical. 

If  the  number  of  columns  in  a  balanced  matrix  Af/xk  is  less  than  n— 1  (i.e., 
if  k<n—  1),  then  its  perfect  matching  graph  is  not  unique  because  each  distinct 
row  in  A  appears  2"-*  times.  If  k=n—  1,  then  PGA  is  unique  because  each  dis¬ 
tinct  row  in  A  appears  twice.  As  an  example,  consider  the  balanced  matrix  //8x2 
presented  just  after  Definition  II.4.  Its  perfect  matching  graph  is  unique  and 
shown  in  Figure  III.  1. 

Definition  m.2.  (Labeling):  2— labeling  or  2— coloring  of  a  graph  is  the 
assignment  of  integers  0  and  1  to  its  vertices  such  that  the  labels  of  the  vertices 
incident  with  an  edge  are  different. 

Fact  m.l.  [2]  The  union  of  two  perfect  matching  graphs  with  the  same 
set  of  vertices  is  a  union  of  disjoint  even  cycles  and,  therefore,  it  can  be  2- 
labeled. 


t»i  v3  w5  v7 


Figure  II1.1.  The  perfect  matching  graph  of  H%.2 

Given  balanced  matrices  DVn  and  E  1:n,  the  following  algorithm 
CONS—COLUMN  determines  a  column  vector  V  such  that  Sj.„  =  [Z>2;n  V]  and 
T1;n  *  [f?2:n  V]  are  balanced.  The  detailed  steps  of  this  serial  algorithm  are 
shown  in  procedure  COLUMN. 


-  n  - 


Algorithm  CONS-COLUMN 
Input:  Balanced  matrices  Dx  n  and  E u. 

Output:  A  vector  V  such  that  \D2  n  V\  and  \En  n  V ]  are  balanced. 

Step  1.  Determine  Df  J,  and  E fj,  which  represent  the  inverse  of  the  permu¬ 
tations  corresponding  to  Dx  n  and  E x  n,  respectively. 

Step  2.  Determine  the  perfect  matching  graph  of  D2  n  by  joining  the  ver¬ 
tices  with  indices  and  El\(j+(N/ 2))  by  an  edge, 

0  <  j  <  (N /2) — 1 .  Similarly,  determine  the  perfect  matching  graph 
of  E2  n  by  joining  the  vertices  with  indices  Ei\(j)  and 
E\ln{j+{N  /2))  by  an  edge. 

Step  3.  2-label  each  cycle  of  the  union  of  the  perfect  matching  graphs  of 
D2  n  and  E 2  n.  The  vector  V  is  such  that  V'(r)  equals  the  integer 
assigned  to  the  vertex  with  index  r  in  2-labeling,  where 
0  <  r  <  JV—1.  Stop. 


To  prove  the  correctness  of  Algorithm  CONS-COLUMN,  note  that  D\  ln(i) 
can  be  thought  of  as  a  pointer  to  the  row  with  value  »  of  D  1:n  because  it  equals 
the  index  of  the  row  of  DX  „  whose  contents  equal  i.  This  implies  that  any  two 
rows  of  D !•„  whose  indices  equal  Dj’n(j)  and  D\Xn(j+(N/2))  differ  only  in  their 
leftmost  bits.  When  the  perfect  matching  graph  of  D 2;„  needs  to  be  con¬ 
structed,  it  follows  from  Definition  III.  1  that  the  vertices  with  indices  Di-}n(j) 
and  Dyln[j+(N /2))  must  be  joined  by  an  edge.  Similarly,  in  case  of  the  perfect 
matching  graph  of  E2n,  the  vertices  with  indices  Ex  ln{j)  and  £f.j,(j-|-{iV/2)) 
must  be  joined  by  an  edge.  By  Fact  III.  1 ,  the  union  of  two  perfect  matching 
graphs  can  be  labeled  in  the  way  that  the  labels  of  the  vertices  incident  with  an 
edge  constitute  the  set  {0, 1 }.  This  completes  the  correctness  proof  of  Algo¬ 
rithm  CONS-COLUMN. 

Procedure  COLUMN,  an  implementation  of  Algorithm  CONS-COLUMN, 
is  now  discussed  in  order  to  rigorously  analyze  the  complexity  of  the  algorithm. 
It  calls  two  simple  procedures  namely  CONS— LIST(N)  and  DELETE(k)  which 
constructs  a  linear  linked  list  of  N  nodes  and  deletes  the  k th  node  of  the  list, 
respectively  (these  procedures  are  described  in  Appendix).  Given  balanced 
matrices  D\:n  and  E1:n,  procedure  COLUMN  determines  a  column  vector  V 
such  that  the  matrices  [Dz.n  V]  aQd  \E2:n  V\  are  balanced.  The  lines  1  to  4  in 
procedure  COLUMN  construct  Djln  and  EJln  which  represent  the  inverse  of  the 
permutation  represented  by  DX  n  and  E  1:„,  respectively.  The  lines  5  to  10 
create  as  many  circular  doubly  linked  lists  as  the  number  of  cycles  in  the  union 
of  the  perfect  matching  graphs  of  Di:n  and  Ei:n.  (Recall  that  any  of  these 
cycles  has  an  even  number  of  vertices).  Any  node  of  these  lists  has  three  fields 
called  RX,  R j  and  V,  where  R  j  and  R2  are  used  for  pointers  and  V  will,  later, 


-  12  - 


be  assigned  either  0  or  1  in  2-labeling.  The  fields  R{  (respectively,  R 2)  are  used 
to  obtain  the  perfect  matching  graph  of  D2  n  (respectively,  E 2;n).  The  pointers 
R  j  and  R->  of  any  one  of  these  lists  are  ordered  as 
(/?1/?2/?2/?1  •  •  •  R  iR2R2R  1).  The  line  11  calls  the  procedure  CONS_LIST(N) 
to  construct  a  linear  linked  list,  called  H,  of  TV  nodes.  The  lines  12  to  24  do  the 
2-labeling  of  all  the  cycles  in  the  union  of  the  perfect  matching  graphs  of  D2n 
and  E2  n  by  assigning  0  or  1  appropriately  to  the  fields  V  of  nodes  of  the 
corresponding  circular  doubly  linked  lists.  The  2-labeling  of  the  cycle  contain¬ 
ing  the  vertex  of  index  0  is  done  first.  Whenever  the  field  V  of  a  node  is 
assigned  either  0  or  1,  the  node  is  deleted  from  the  list  H,  so  that  H  contains 
only  those  nodes  which  are  not  assigned  any  integer  yet.  If  the  node  to  be 
deleted  is  the  header  node,  the  pointer  to  the  header  is  changed  to  point  to  the 
node  following  the  header  node.  When  all  the  nodes  of  H  are  deleted,  it  means 
that  all  the  cycles  are  2-labeled.  When  the  2-labeling  of  a  cycle  is  completed 
(i.e.,  when  the  “while”  loop  between  the  lines  18  and  23  is  exited),  the  next 
cycle  to  be  2-labeled  is  the  one  that  contains  the  node  with  the  index  p  pointing 
to  the  header  of  H.  So,  the  lines  12  to  24  do  the  2-labeling  of  N  nodes  one  by 
one.  Also,  note  that  the  “for”  loops  are  executed  O(N)  times.  Therefore,  the 
time  complexity  of  procedure  COLUMN  is  O(N). 

line  procedure  COLUMN 

1  for  »  :=  0  to  N— 1  do 

2  dtUdwV  ■■=  ■' 

3  £rU^.:n(>'))  -  • 

4  end 

5  for  j  :=  0  to  (iV/2)— 1  do 

6  *l(Crt(j))  :=  DTiU+{N/2)) 

7  R,{Dl'nU+(N/2)))--=DTi.U) 

8  Rt{ET\U))  ~  ETMMN/2)) 

»  R,(ETiU+(N/m  :=  ETiU) 

10  end 

11  call  CONS_LIST(N) 

12  p  :=  0 

13  while  p  ^  nil  do 

14  V(p):=0;  V(RM):=1 

15  m  :=  R2(R  1  (p)) 

16  call  DBLETG(p) 

17  call  DELETE^  ,(/>)) 


-  13  - 


18 

while  F(m)  is  empty 

do 

19 

V{m)  :=  0;  ^(/^(m))  :=  1 

20 

m  :=  R2{Rx{m)) 

21 

call  DELETE(m) 

22 

call  DELETE(Pj( 

m)) 

23 

end 

24 

end 

25 

end  COLUMN 

Algorithm  CONS-COLUMN  and  procedure  COLUMN  can  be  adapted  for 
execution  on  an  N  processor  PRAM  machine  [21],  as  shown  next  in  Algorithm 
CONS-COLUMNJPRAM  and  procedure  COLUMN-PRAM. 

AJgorithm  CONS-COLUMNJPRAM 
Input:  Balanced  matrices  Z)1;n  and  E\.n. 

Output:  A  column  vector  V  such  that  \D2:n  and  [E2:n  P]  are  balanced. 

Step  1.  Determine  the  perfect  matching  graphs  of  D2: „  and  E 2;„. 

Step  2.  Represent  each  cycle  of  the  union  of  the  perfect  matching  graphs  of 
D 2n  and  E2:n  by  a  circular  doubly  linked  list. 

Step  3.  For  each  list,  first  determine  the  node  with  smallest  index  and  then 
assign  “nil”  to  one  of  its  pointers. 

Step  4.  For  each  list,  determine  the  distance  of  each  node  to  the  node  with 
smallest  index. 

/*  Note  that,  although  the  distance  of  any  node  of  a  circular  dou¬ 
bly  linked  list  to  the  node  of  smallest  index  can  be  measured  in  two 
different  directions,  both  distances  of  the  same  node  are  either  odd 
or  even  because  the  list  contains  an  even  number  of  nodes.  */ 

Step  5.  For  each  list,  assign  0  (respectively,  l)  to  all  those  nodes  whose  dis¬ 
tances  are  even  (respectively,  odd).  Stop. 

The  correctness  proof  of  Algorithm  CONS— COLUMN_PRAM  is  the  same 
as  that  of  Algorithm  CONS-COLUMN,  except  that  Algorithm 
CONS-COLUMN-PRAM  is  implemented  using  N  processors  instead  of  a  single 
processor. 

Given  balanced  matrices  Di:n  and  E  1;n,  procedure  COLUMNJPRAM 
determines  a  column  vector  V  using  an  N  processor  PRAM  model  such  that  the 
matrices  \D2  n  V]  and  [E 2:n  V]  are  balanced.  Assignments  which  are  local  and 
not  local  to  the  processors  are  denoted  by  the  operators  and  respec¬ 

tively.  Any  variable  with  index  x  is  an  operand  in  processor  PE(x),  where 


-  14  - 


0  <  i  <  N— 1.  Lines  1  and  2  determine  the  inverse  of  the  permutations 
represented  by  D  1:„  and  EXn.  Lines  3  to  10  represent  each  cycle  of  the  union 
of  the  perfect  matching  graphs  of  D2n  ar'd  E2 :n  by  a  circular  doubly  linked 
list.  The  pointers  Rx  and  R2  of  any  one  of  these  lists  are  ordered  as 
(RXR2R2R i  '  ’  '  R iR2R2R i).  Let  2  <  p  <  n.  Any  one  of  these  lists  can  have 
at  most  2*  nodes,  where  k=p  —  1  if  Dp  n  =  iip.,,,  else  k=-n. 

Lines  11  and  12  copy  the  circular  lists  into  new  lists  whose  pointers  are 
denoted  by  Sx  and  S2.  Lines  13  to  21  determine  the  node  of  the  smallest  index 
in  each  list  and  the  smallest  index  is  stored  in  M(t).  Line  13  initializes  M(i)  to 
i.  Lines  14  and  15  find  the  smaller  index  of  every  two  nodes  that  point  to  each 
other  by  Sj  fields  and  then  the  M(i)s  of  these  nodes  are  assigned  the  smaller 
index.  Line  16  partitions  each  circular  linked  list  into  two  sublists,  such  that 
each  sublist  is  formed  by  the  Sx  fields  of  the  nodes  and  contains  every  other 
node  of  the  list.  Due  to  the  lines  14  and  15,  the  smallest  one  of  the  node  indices 
of  a  list  is  stored  in  the  M(i)  of  a  node  in  the  every  sublist.  This  guarantees 
that  the  M(i)s  of  all  the  nodes  of  a  list  will  eventually  contain  the  same  index 
even  though  the  list  is  partitioned  into  two  sublists.  Because  a  sublist  can  have 
at  most  2*"1  nodes  and,  after  the  jth  iteration,  S^i)  of  node  *  points  to  the 
node  which  is  2;  nodes  away  from  node  »,  the  loop  “for”  in  lines  17  to  21  needs 
to  be  executed  at  most  k—  1  times.  Hence,  the  loop  “for”  takes  O(k)  times. 

Lines  22  to  35  determine  d(i)  which  is  the  distance  of  node  :  to  the  node  of 
the  smallest  index.  Line  23  partitions  each  linked  list  into  two  sublists,  each 
formed  by  the  R  x  fields  of  the  nodes  and  contains  every  other  node  of  the  origi¬ 
nal  list.  The  distance  d(i)  is  first  initialized  to  2  (line  24),  then  d(t') s  of  the 
node  with  the  smallest  index  and  its  neighboring  nodes  are  made  equal  to  0,  1, 
and  1,  respectively,  and  their  i21(t)s  are  assigned  “nil”  (lines  25  to  29).  The 
reason  why  d(i)  of  all  the  nodes  other  than  the  nodes  with  the  smallest  index 
and  its  neighboring  nodes  are  equal  to  2  initially  is  that  every  sublist  of  a  list 
contains  every  other  node  of  a  list.  For  the  same  reason,  one  of  the  two  sublists 
contains  the  node  with  the  smallest  index  while  the  other  sublist  contains  the 
neighboring  nodes  of  the  node  with  the  smallest  index.  So,  a  sublist  has  at  least 
one  node  with  Rx  field  assigned  “nil”.  The  distances  of  the  other  nodes  of 
every  list  to  the  node  of  the  smallest  index  are  computed  in  the  loop  “while” 
between  the  lines  30  to  35.  This  loop  can  be  executed  at  most  k  times  because, 
after  the  jth  iteration,  R\{i)  of  node  i  of  a  sublist  points  to  the  node  which  is  21 
nodes  away  from  node  i,  and  /21(i)  is  assigned  “nil”  when  the  Rx  field  of  the 
node  pointed  to  by  the  R\{i)  equals  “nil”.  Lines  36  and  37  do  the  2-labeling  of 
the  nodes  of  every  list.  Because  the  loops  “for”  and  “while”  of  procedure 
COLUMN_PRAM  take  O(logiV)  times,  its  time  complexity  equals  0(\ogN). 


-  15  - 


line 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 

23 

24 

25 

26 

27 

28 

29 

30 

31 

32 

33 

34 

35 

36 

37 

38 


procedure  COLUMN-PRAM 

or!.(0.:„('))  —  >' 

ETHEi.M)  - ' 

if  i  >  0  and  i  <  (jV/2)— 1  then  do 

«i(orl.(0)  -  D;i(i+(N/2)) 

52(£Ti,(0)  -  ertii-HNfi)) 

end 

if  i  >  N /2  and  i  <  N— 1  then  do 

soerUO)  -  i>r»(.-(w/2)) 
fl2(£Ti,(0)  -  £7i(<-(w/2)) 

end 

S,(0  :-*,(«■) 

S2(«)  :-!?,(«■) 

M(j')  :=  i 

r(0-^(5,(0) 

M(i) mm{T(i),M(i)} 

Si(0-S2(51(«)) 
for  j  :=  1  to  k-1  do 

T(0-M(S,(0) 

S,(.')-Si(s,(0) 

M(i)  :=  min{r(i),M(«)} 

end 

t(i)  :=Ri(«) 

/2j(0  —  JR2(i2,(«)) 
d(i)  :=  2 

if  M(i)  =  «  then  do 

d(i)  :=  0  ;  /2  j  (*’)  :=  nil 
d(t(i))  «-  1  ;  Xi(t(i))  —  mV 
d(/22(i))--i;i21(i22(,))^n«V 

end 

while  there  exists  a  node  »  such  that  J2j(i)  ^  nil  do 
if  /?i(0  7***1  then  do 

*(0  «- «f(0  +  *(*,(0) 

i*i(0  —  «i(J*i(0) 

end 

end 

if  d(i)  ==2x|d(i)/2j  then  no 0 

else  V(i)  :=*  1 

end  oOLUMN-PRAM 


An  example  is  presented  below  to  show  how  circular  doubly  linked  lists 
are  used  to  represent  the  cycles  of  the  union  of  the  perfect  matching  graphs  of 
two  given  balanced  matrices.  Also,  the  2-labei'mg  of  these  lists  and  the  resulting 
column  vector  are  shown. 

Example  III.l.  Let  1 1  =  (0  4  1  10  9  6  15  3  5  7)(2  1 l)(8  12  13)(14)  and 
4>  =  (0  2)(1  3  9  4)(5  8  10  1 1  )(6  12  7  13  15  14)  be  the  given  permutations. 
Assume  that  D]4  and  Ei4  represent  II  and  in  binary,  respectively.  The 
inverse  of  these  permutations,  denoted  by  II-1  and  4>_1,  can  be  easily  deter¬ 
mined:  a  =  II-1  =  (0  7  5  3  15  6  9  10  1  4)(2  ll)(8  13  12)(14)  and 

=  4>-‘  =  (o  2)(1  4  9  3)(5  11  10  8)(6  14  15  13  7  12).  The  perfect  matching 

graphs  of  D2:4  and  £2 :4>  denoted  by  PGg  and  PGg ,  respectively,  are  illustrated 
in  Figure  III. 2.  The  union  of  the  perfect  matching  graphs  of  D24  and  E24, 
denoted  by  PGg^jg,  is  illustrated  in  Figure  III. 3. 

Each  cycle  of  PGg^g  is  represented  by  a  circular  doubly  linked  list  such 

that  the  pointer  field  R\  (respectively,  R2)  of  each  node  is  used  for  PGq 

(respectively,  PGg).  The  corresponding  lists  are  shown  in  Figure  III.4a.  Each 
list  (as  well  as  a  cycle)  can  be  2-labeled  in  two  different  ways.  The  integers 
assigned  to  the  nodes  of  a  list  in  a  2-labeling  are  stored  in  their  V  fields.  The 
column  vector  V  obtained  from  these  2-labelings  is  shown  with  the  given 
matrices  Dv 4  and  EV4  in  Figure  III.4b. 

ff(0)=7  ff(l)=4  <t(2)=11  <t(3)=15  <t(4)=0  <x(5)=3  <r(6)=9  <r(7)=5 


O  ••••••• 

oilliili 

1=13  er(9l=10  <xfl0l=l  <7(111=2  <r(12l=8  <7(131=12  <r(14l=14  <7(15) 


<r(8)=13 

<x(9)=10 

<t(10)=1 

<r(ll)=2 

(») 

<r(12)=8 

<r(13)=12 

<r(14)=14  <r(15)=6 

M0)=2 

M(l)=4 

/^(2)=0 

^(3)=1 

A*(4)=9 

I<(5)=11 

A*(®)=  14  /i(7)=12 

o  ••••••• 

oiiliii* 

/i(8)=5  #i(9)=3  /i(10)=8  p(ll)=10  /i(12)=6  /i(13)=7  p(l4)=15  /i(15)=13 

(b) 

Figure  III. 2.  (a)  PGD,  the  perfect  matching  graph  of  Dn.A.  (b)  PGg,  the  perfect  matching 

graph  of  En:i .  The  numbers  on  an  edge  refer  to  the  indices  of  its  vertices. 


-  18  - 


Although  this  algorithm  works  like  Lee’s  algorithm  in  the  sense  that  it  sets  the 
SBs  of  a  stage  at  a  time,  starting  with  the  leftmost  stage  and  heading  towards 
the  rightmost  stage,  it  is  easy  to  understand  and  simple  to  program. 

When  the  Benes  network  realizes  a  permutation  represented  by  a  balanced 
matrix  B  1:n,  the  mth  input  of  the  Benes  network  is  sent  to  the  output  whose 
value  equals  Bx  „{m),  0  <  m  <  N—l.  In  order  for  a  balanced  matrix  BX:n  to 
also  pass  the  Benes  network,  B l:n  must  be  transformed  by  into 

another  balanced  matrix  which  can  pass  RBX  n.  Algorithm  CNTR—BENES  first 
determines  a  balanced  matrix  CI:(n_i)  which  passes  BE i:(n-i)  and  transforms 
„  into  another  balanced  matrix  passing  RBX  n.  Algorithm  CNTR—BENES, 
then,  determines  the  routing  tag  matrix  f/i  (2n-i)  whose  mth  row  is  the  routing 
tag  for  the  mth  input  of  Benes  network  to  implement  a  given  permutation 
matrix  Bx  n  through  it. 

The  routing  scheme  used  in  Benes  network  is  as  follows:  at  the  jth  stage, 
1  <  j  <  2n— 1,  of  Benes  network,  every  SB  examines  the  bit  u;  of  the  routing 
tag  (uiu2  •  •  •  u2„_i)  of  its  each  input.  If  u;= 0,  the  input  is  sent  to  the. upper 
output  of  the  SB;  otherwise,  the  lower  output  of  the  SB  is  taken. 

Algorithm  CNTR—BENES 

Input:  A  balanced  matrix  Bx  n  —  [6 2  &2  ■  •  *  &„)  and  the  reverse  permuta¬ 
tion  matrix  R  1:n. 

Output:  A  matrix  t/1:(2n_i)  such  that,  when  its  mth  row,  0  <  m  <  N—l,  is 
used  as  the  routing  tag  for  the  mth  input  of  the  Benes  network,  the 
permutation  represented  by  Bx  n  is  realized  by  the  Benes  network. 

Step  1.  Let  p  denote  an  integer  variable.  Let  Bi;n  =  \bn  bn„x  •  *  •  6jJ.  Set 
p  =1.  Determine  a  column  vector  cx  such  that  the  matrices 
[/?2:n  c j ]  and  [Bi;(n_i)  Ci]  are  balanced. 

Step  2.  Increment  p  by  1.  If  p>n— 1,  go  to  next  step;  otherwise,  first  deter¬ 
mine  a  column  vector  ep  such  that  [B(p+i):n  C7 and 
[B\ \n-p)  Ci:(p-i)]  are  balanced,  and  then  go  to  Step  2. 

Step  3.  Let  i:(2„— l)  =  (Ci:(r»-i)  #!;„].  A  switch  at  the  jth  stage, 

1  <  j  <  2n— 1,  of  the  Benes  network  examines  the  bit  u;  of  the 
routing  tag  (t*!  u2  •  •  •  U2n-i)  of  its  input  to  set  itself  either  straight 
or  cross.  Stop. 

Theorems  IV.  1  and  IV.2  below  are  used  in  the  correctness  proof  of  Algo¬ 
rithm  CNTR-BENES.  The  proof  of  Theorem  IV.2  appears  in  Appendix. 
Theorem  IV.l  establishes  the  correspondence  between  RBx-k  and  F[ f*. 


-  19  - 


Theorem  IV.l.  A  matrix  D1:*  =  (dj  d2  ’  *  *  <4]  Sts  F\ ?*  *f  and  only  if 
D  i  *  passes  RB n,  1  <  k  <  n.  Moreover,  /ZB]  *  sends  its  ith  input  to  its  jth 


output,  where  j  is  equal  to  the  sum  of  ( 


!  /2* 


x2*)  and 


the  value  of  Z?1:*(t). 


Basic  idea  of  proof: 

(—)  B>1:*  fits  Bia*  —  B1:*  passes 

Induction  on  &  is  used.  For  k  =  1,  each  rectangle  of  F*“k  has  a  0  and  a  1.  These 
are  the  control  bits  of  a  switch  in  RBuk  and,  thus,  no  conflict  occurs.  For  k>\, 
assuming  the  theorem  holds  for  k— 1,  each  switch  in  the  kth  stage  “has”  control 
bits  0  and  1  and,  therefore,  no  conflicts  occur.  These  control  bits  must  appear 
as  the  kth  bits  at  the  end  of  identical  (k—  l)-bit  rows  of  subframes  F^'xk-i  and 
F^-'xk-i  of  F if*  so  that  Dlk  fits  F\*k.  Each  subframe  corresponds  to  a  subnet¬ 
work  of  RBi  k  which  is  also  a  reverse  baseline  network  RB2*-'x*_i. 

(<-)  D\.k  passes  RBvk  — ►  D1;*  fits  F\ak. 

Induction  on  k  is  used.  For  k= 1,  if  dx  passes  RB\,  then  each  rectangle  of  F\ak 
contains  a  0  and  a  1  and  dx  fits  /{“.  For  k>  1,  assuming  the  theorem  holds  for 
k—1,  for  the  outputs  of  two  subnetworks  RB2  *L,x*-l  and  iffljf-'xi-i  to  cause 
no  conflict  in  any  switch  of  the  k th  stage  it  must  be  the  case  that  a  0  and  a  1 
are  added  to  the  A:— 1  entries  of  identical  rows  of  the  frames  that  correspond  to 
the  two  subnetworks.  This  implies  that  D\  k  fits  F*“k.  The  value  of  j  follows 
from  the  topology  of  RBvk  and  how  switches  are  set  by  control  bits.  □ 


Theorem  IV.2.  Consider  a  balanced  matrix  D^xk,  1  <  k  <  n,  the  reverse 
permutation  matrix  Rf/xn  and  the  frame  F*fixk.  The  matrix  [Bjvxn  Avx*]  is  bal¬ 
anced  if  and  only  if  Bjvx*  fits  Ftf-^. 


To  prove  the  correctness  of  Algorithm  CNTR-BENES,  note  that  it  follows 
from  Algorithms  CONS-COLUMN  and  Algorithm  CONS_COLUMN_PRAM 
that  Step  1  and  Step  2  of  Algorithm  CNTR-BENES  are  realizable.  Now,  it 
remains  to  show  that,  when  the  mth  row  of  (Si:(2n-i)  is  used  as  the  routing  tag 
for  the  mth  input  of  the  Benes  network,  the  permutation  represented  by  B  1:n  is 
realized  on  the  Benes  network. 

Because  \R  l:n  O is  balanced,  it  follows  from  Theorems  IV.l  and 
IV.2  that  Ci;(n_i)  passes  BBx;(n_j).  Because  BB1:n  is  functionally  and  topolog¬ 
ically  equivalent  to  B2?1;n,  RB can  be  converted  to  BB];(„_j)  by  reposi¬ 
tioning  its  switches  only.  This  implies  that  C  1;(n-i)  also  passes  B£1:(„_i),  that 
is,  no  conflict  occurs  in  the  switches  of  BB1:(„_i)  when  the  ith  input  of 
is  used  as  the  routing  tag  for  the  ith  input.  Let  x  denote  a  column  vector  such 
that  z)  is  balanced.  Because  BEVn  (and  RB1:n)  realizes  the  permuta¬ 

tion  represented  by  (Ci^n-i)  z]  (Theorem  IV.l),  the  routing  tags  form  the 


-  20- 


matrix  71:n  when  they  reach  the  outputs  of  BE^  This  implies  that,  when  the 
mth  row  of  is  used  as  the  routing  tag  for  the  mth  input  of  i n  —  i ) , 

the  routing  tags  form  the  matrix  j)  at  the  outputs  of  So,  when 

the  mth  row  of  U i;(2n-i)  1S  used  as  the  routing  tag  for  the  mth  input  of 
BE i:(n_i) ,  the  routing  tags  form  the  matrix  C/,‘(2„_i)  =  [71;(fi_i)  £i*:n]  at  the 
outputs,  where  B  1:n  denote  the  matrix  obtained  from  B  1:ri  by  routing  it  with 
0 1:(„  —  i)-  It  will  be  shown  next  that  BX  n  is  realized  by  RBin. 

Let  1  <  r  <  n— 1.  The  rth  stage  of  called  BEr,  is  a  pile  of  2r_1 

inverse  SE  stages  on  2n-r+1  inputs/outputs  each.  The  matrix  Vv\^n-\)  of  rout¬ 
ing  tags  is  assigned  [ Ci:(«— i)  ■5i:n]  io  Step  3  of  Algorithm  CNTR-BENES. 
Note  that  the  matrix  \B[  n  Cj^n-i)]  18  balanced  due  to  the  way  is  con¬ 

structed  in  Steps  1  and  2  of  Algorithm  CNTR-BENES.  Because  the  matrix 
[B i  n  C1:(n_i)]  is  balanced  and  Ci;(ft_i)  passes  BE — i) ,  the  following  is  true: 
the  first  stage  of  BE i:(n-i)  partitions  BVn  into  two  submatrices  of  2"-1  rows 
each  such  that  the  first  n— 1  columns  of  each  submatrix  form  a  balanced 
matrix;  similarly,  each  of  the  2  inverse  SE  stages  of  BE2  partitions  the  subma¬ 
trix  at  its  inputs  into  two  submatrices  of  2n-2  rows  each  such  that  the  first 
n— 2  columns  of  each  submatrix  form  a  balanced  matrix;  so,  each  inverse  SE 
stage  of  BEr  partitions  the  submatrix  arrived  in  its  inputs  into  two  submatrices 
of  2n~r  rows  each  such  that  the  first  n—r  columns  of  each  submatrix  form  a 
balanced  matrix.  This  implies  that,  when  Bi;n]  is  routed  through 

BEX  (n_i),  [Ci:(n-i)  7?i:»»]  is  transformed  into  [/i:(B_i)  Bi:„]  such  that  B1:n  fits 
F\an.  It  follows  from  Theorem  IV.l  that  B[  n  can  be  realized  by  RBl  n.  There¬ 
fore,  the  Benes  network  realizes  the  permutation  represented  by  BVn  when  the 
»th  input  uses  the  *th  row  of  C/j^n-i)  the  routing  tag.  This  completes  the 
correctness  proof  of  Algorithm  CNTRJBENES. 

Steps  2  and  3  of  Algorithm  CNTRJBENES  can  be  implemented  using  both 
the  procedure  COLUMN  and  procedure  COLUMN-PRAM  which  are  serial  and 
parallel,  respectively.  Recall  that  procedure  COLUMN  and  COLUMN_PRAM 
take  O(N)  time  and  O(logN)  time,  respectively.  Because  the  columns  of 

C  i  ;(n _ i )  are  determined  sequentially,  Algorithm  CNTRJBENES  takes 

O(AflogiV)  time  in  serial  and  0(log2N)  in  parallel.  Due  to  the  fact  that  it  takes 
O(logN)  time  to  set  up  the  switches  of  the  Benes  network  if  the  routing  tags 
are  available,  realizing  a  permutation  on  the  Benes  network  using  the  Algo¬ 
rithm  CNTR-BENES  takes  0(N\ogN)  time  in  serial  and  0(log2N)  in  parallel. 
As  shown  next,  these  complexities  are  better  than  those  of  the  algorithm,  pro¬ 
posed  by  Lee  [13],  which  also  sets  up  the  switches  of  the  Benes  network  stage  by 
stage  from  left  to  right. 

To  describe  Lee’s  algorithm  [13],  several  definitions  are  needed.  A  Complete 
Residue  System  modulo  m,  CRSfmod  m),  is  a  set  of  m  integers  which  contains 


-  21  - 


exactly  one  representative  of  each  residue  class  mod  m.  A  Complete  Residue 
Partition,  CRP,  is  a  partition  of  CRS(mod  2k)  into  two  CRS's(mod  2*_I), 
k  >  1.  Lee’s  algorithm  sets  the  switches  of  the  rth  stage,  1  <  r  <  n  —  1,  of  the 
Benes  network  as  follows:  in  order  to  perform  2r_1  CRP’s  on  the  2r_1 
CR&s(mod  2n-r  +  1),  switches  are  set  sequentially  in  such  a  way  that  every  two 
switches  that  are  set  consecutively  must  have  a  common  residue  class  number. 
According  to  Lee,  the  control  algorithm  for  Benes  network  takes  0(lV]ogN) 
time  because  each  stage  is  controlled  in  O(N)  time  required  for  the  CRP  con¬ 
trol.  However,  not  included  in  the  analysis  are  0(N/ 2r_1)  comparisons  in  deter¬ 
mining  the  setting  of  the  next  switch  which  has  a  common  residue  class  number 
with  the  switch  just  set  up.  If  the  comparisons  are  accounted  for,  Lee’s  algo¬ 
rithm  indeed  takes  O  (N2 +2(N /2)2  +  •  *  •  +(N / 4)42)  =  0(N2)  time  for  serial 
control.  For  the  same  reason,  the  algorithm  takes  0(N2)  time  when  the  number 
of  comparisons  are  also  included  in  the  complexity  (instead  of  0{N)).  Lee  also 
proposed  in  [9]  a  routing  algorithm  for  SE l:tn_i)IP> SEyln  (i.e.,  reduced  f 
which  also  sets  up  the  switches  using  the  notion  of  CRS{mod  m)  and  CRP. 
Therefore,  this  algorithm  also  takes  0(N2logN)  time  in  serial  and  0(N2)  time 
in  parallel.  The  CRP  done  at  the  first  stage  which  takes  0{N2)  time  deter¬ 
mines  tne  overall  complexity  of  the  parallel  algorithm. 

The  following  example  uses  figures  to  illustrate  how  the  Algorithm 
CNTRJBENES  determines  a  routing  tags  matrix  U i;(2n-i)  for  a  given  permuta¬ 
tion  on  N=S  numbers. 

Example  IV.2.  Let  N=8  and  /2«X3==[ri  r2  r3]-  1°  this  example,  given  a 
permutation  b—( 0  3  2  1  6)(4  7)(5),  the  matrix  UX:i  =  [C1;2  fl1:3]  of  routing  tags 
to  realize  the  permutation  is  determined.  The  binary  representation 
i?i;3  =  (6i  63]  of  b  is  shown  in  Figure  rV.2.  As  explained  in  Algorithm 

CNTR-BENES,  the  matrix  C1:2  needs  to  be  determined.  Columns  of  C1;2  can 
be  determined  one  by  one  from  left  to  right  using  both  procedures  COLUMN 
and  COLUMNJPRAM.  Column  cx  must  be  a  vector  such  that  matrices 
[P2:3  c j ]  and  [fl1;2  cj]  are  balanced.  So,  ex  is  determined  by  a  2-labeling  of  the 
union  of  the  perfect  matching  graphs  of  Ri  3  and  B1;2,  as  shown  in  Figure  IV.2. 
Similarly,  c2  is  determined  by  a  2-labeling  of  the  union  of  the  perfect  matching 
graphs  of  [r3  cj]  and  [6j  Cj],  so  that  matrices  [r3  cx  c2]  and  [61  ex  e2]  are  bal¬ 
anced,  as  shown  in  Figure  IV.2.  Using  the  routing  scheme  described  in  Step  3 
of  Algorithm  CNTR-BENES,  the  switches  of  the  Benes  network  are  set  to  real¬ 
ize  the  permutation  6,  as  shown  in  Figure  IV.3.  End  of  Example. 


-  22  - 


Figure  IV. 2.  (a)  Bt  3  =  (6,  4«  4-,],  the  binary  representation  of  a  given  permutation 

6=(0  3  2  1  8)(4  7)(5);  (b)  B'u,  =  [4,  42  4,];  (c)  Ru,  =  [r,  r...  r:i);  (d)  PGr.>mBu.2, 
the  union  of  the  perfect  matching  graphs  of  fi2:3  and  B , ;  the  solid  edges 
(respectively,  the  dashed  edges)  form  the  perfect  matching  graph  of  J?2.3 
(respectively,  (e)  ct  obtained  from  a  2-labeling  of  PGa^juB,**  (0 

[4.'  4,  e , ];  (g)  (r2  r;,  e ,];  (h)  /,C|,3 .,1m*,  «,|,  the  union  of  the  perfect  matching 
graphs  of  [r3  e,j  and  [4,  c , ];  the  solid  edges  (respectively,  the  dashed  edges) 
form  the  perfect  matching  graph  of  [r3  c  ( ]  (respectively,  [4,  c ,  ]);  (i)  e2 
obtained  from  a  2-labeling  of  Jc><7.r3  c  ,|u|* i  * ,|- 


110001 

00001 


10010 

01011 


01100 

10101 


11110 

00111 


Figure  IV.3.  The  Benes  network  BSvi  with  8  inputs/outputs;  the  switches  are  set  according 
to  the  routing  tags  matrix  Ut: 5  =  (e ,  e2  4,  42  43]  using  the  routing  scheme 
described  in  Step  3  of  Algorithm  CNTR-BENES. 


Because  any  even  component  can  be  2-labeled  in  2  different  ways,  there 
exist  at  least  2 12'  1  different  choices  of  eps.  This  implies  that  the  SBs  of 
SF1:(n SE\}n  has  many  different  settings  to  realize  a  given  permutation. 


-  24  - 


Theorem  V.l.  The  network  SEX  (n_1'J/PJS£]  is  rearrangeable. 

Proof.  It  is  first  shown  that  all  the  switches  at  last  stage  of  the  network 
SEi  n  (i.e.,  omega  network)  are  set  straight  if  the  last  bit  of  any  input  equals 
the  last  bit  of  its  destination  address.  Let  «m= * J”  1*2*  '  '  '  i?  and 

dr=d\d2  •  •  ■  drn  denote  the  label  and  destination  address  of  the  mth  input, 
respectively,  where  0  <  m, r  <  N— 1.  Note  that  is  the  mth  row  of  /l  n 
because  the  inputs  are  labeled  from  0  to  N—l  in  ascending  order.  Lawrie  [l] 
has  shown  that,  at  any  given  stage  k,  1  <  A:  <  n,  an  input  at  the  position 
.Ti?+1  ‘  ‘  ‘  ^nd\d2  ‘  ‘  ‘  ^*-1  >s  mapped  by  the  perfect  shuffle  pattern  to  the 
position  «r+i*T+2  ‘  ‘  ‘  *nd  [d2  ‘  ‘  ‘  dj _  1 1  ™ ,  and  is  then  switched  into  the  posi¬ 
tion  «T+i,T+2  ‘  ‘  '  C d\d2  •  •  •  d\^xdTk  by  the  corresponding  SB.  Thus,  at  the 
nth  stage  of  SE1:n  an  input  which  has  been  switched  to  position 
i™d[d2  •  •  •  drn~  1  is  mapped  by  the  perfect  shuffle  pattern  to  d\d2  ’  *  • 
and  is  then  switched  into  position  d\  d2  •  ■  •  dTn_xdTn.  So,  if  i™=drn,  then  the 
switch  to  which  «m  is  an  input  is  set  straight.  Therefore,  if  i™=drn  for  every 
pair  of  input  and  its  destination,  then  all  the  switches  at  the  last  stage  are  set 
straight,  i.e.,  at  the  last  stage  only  the  shuffle  pattern  is  needed  and  all  the 
switches  can  be  removed. 

Given  a  balanced  matrix  B\  n,  there  exists  a  balanced  matrix  D1:n  deter¬ 
mined  by  the  repeated  application  of  Lemma  III.l  such  that  the  matrices 
[7j:n  Z? [;n]  and  [Z?1:n  £),.„]  are  balanced  (2).  This  means  that  a  given  permuta¬ 
tion  represented  by  Bl:n  can  be  expressed  as  a  product  of  permutation  Dl  n 
realized  on  SEX:n  (Proposition  5.1  [2])  and  permutation  D  Xn—+B  X:n  realized  on 
SExln  (Lemma  V.l).  It  is  shown  below  that  the  last  column  of  D  1;„  can  always 
be  made  equal  to  the  last  column  of  7I;n,  which  implies  that  the  last  stage  of 
SEX  n  can  be  replaced  by  IP’  because  it  follows  from  the  above  paragraph  that 
in  this  case  all  the  switches  of  this  stage  can  always  be  set  straight.  Because 
[71;n  Dx  (n-i)]  is  balanced,  the  matrix  [«n  Z? i:(n.— 1)]  13  balanced.  Because  the 
number  of  columns  in  [»n  1)]  *s  not  greater  than  n,  »„]  is  also 

balanced.  So,  the  matrices  (7l:n  7?1:(n_jj  »„]  and  [i?1;n  ^i:(n— 1)  *n]  are  balanced, 
which  implies  that  dn  can  be  made  equal  to  Because  [71:n  D «„]  13  bal¬ 
anced,  the  permutation  represented  in  binary  by  [Z?1:(n_i)  irt]  passes  SEX:n. 
Thus,  the  mth  input  of  SEX:n,  denoted  by  «m,  is  sent  to  the  output  dT  whose 
value  equals  the  contents  of  the  mth  row  of  «„].  When  the  mth  row  of 

(7>i;(n_i)  «„]  is  used  as  the  routing  tag  for  the  mth  input  of  SEXn,  all  the  SBs  of 
SEn  are  set  straight  because  drn  —  i  JJ*.  This  implies  that  the  nth  bits  of  the 
routing  tags  are  no  longer  needed.  Therefore,  when  all  the  SBs  of  the  last  stage 
of  SEX:n  are  removed,  the  permutation  corresponding  to  [7>i;(„-i)  «n]  can  be 
realized  through  SEx^n-X)IP*  using  only  DX:(n-i)  as  a  matrix  of  routing  tags. 
Because  [B1:n  D *»]  >*  balanced,  it  follows  from  Lemma  V.l  that  SEJ J, 


-  25  - 


realizes  the  permutation  represented  by  [Z> i ;( n _ i)  »„]—*£ 1;B.  Therefore,  the  per¬ 
mutation  represented  by  R1:n  is  realized  by  the  network  SE  y^n.^IPs  SEy^, 
that  is,  SE|  („_|)/PJ5£'7.1n  is  rearrangeable.  □ 

The  following  algorithm  determines  the  routing  tags  of  the  inputs  of 
SE]  (n_])/P1SE7':n  to  realize  any  permutation. 

Algorithm  RSE 

Input:  A  balanced  matrix  BVn  =  [&i  b2  '  *  •  6„]  and  the  identity  permuta¬ 
tion  matrix  71:n. 

Output:  A  matrix  C^i.(2n-i)  such  that,  when  its  mth  row,  0  <  m  <  N— 1,  is 
used  as  the  routing  tag  for  the  mth  input  of  SEi  (n_1)7P4SEf:Ji,  the 
permutation  represented  by  B1:n  is  realized  by  SEi  ^n^^IP,SEyln. 

Step  1.  Let  p  denote  an  integer  variable.  Set  p=l.  Determine  a  column  vec¬ 
tor  dx  such  that  the  matrices  (72;n  and  [B2n  are  balanced. 

Step  2.  Increment  p  by  1.  If  p>n— I,  go  to  next  step;  otherwise,  first  deter¬ 
mine  a  column  vector  dp  such  that  [7(p+1) „  7?!  .„_!)]  and 
\B(,  +i):n  E>i:(p_1)]  are  balanced,  and  then  go  to  Step  2. 

Step  3.  Let  I7i:(2n-i)  =  (-7? i:(»— i)  #i:nj*  For  1  <  j  <  n—  1,  a  SB  at  the  jth 
stage  of  5E1;{n_!)  examines  the  bit  u;  of  the  routing  tag 
(ui  u2  •  •  •  «2n-i)  °f  Its  inputs,  while  for  n  <  j  <  2n— 1  a  SB  at 
the  jth  stage  of  SE1;(n_1)7P#SEj':)l  examines  the  bit  u3n_;+J  of  the 
routing  tag  (ut  u2  •  *  •  u2n_1)  of  its  input.  Stop. 

To  prove  the  correctness  of  Algorithm  RSE,  note  that  it  follows  from  Algo¬ 
rithms  CONS-COLUMN  and  CONS_COLUMN_PRAM  that  Step  1  and  Step  2 
of  Algorithm  RSE  are  realizable.  Theorem  V.l  has  shown  that  BX  n  passes 
SE1:(n_1)7P<SE]*:Jl.  Now,  it  remains  to  show  that  the  mth  row  of 
—  [7?i:(b-i)  7?i:n]  can  be  used  as  the  routing  tag  for  the  mth  input  of 
SEi  [n-i)IP*SEi  ln  to  realize  BX  n. 

Let  7/ i;(2f»— l) (m)  denote  the  mth  row  of  Cfi:(2r»-i)-  The  destination  tag 
routing  scheme  used  in  SE j.*,  k  >  1,  (respectively,  SEy\)  is  as  follows:  at  the 
jth  stage,  1  <  j  <  fc,  of  SE1;*  (respectively,  SE^.*)  a  SB  examines  the  uy 
(respectively,  the  u*_;+i)  of  the  routing  tag  (uju2  •  •  •  u*)  of  its  input  [l].  If 
uy=0,  the  input  is  sent  to  the  upper  output  of  the  SB;  otherwise,  the  lower  out¬ 
put  of  the  SB  is  taken.  Therefore,  the  routing  scheme  described  in  Step  3  is 
used,  and  C71:(2b_1)  is  assigned  2?j;b].  This  completes  the  correctness 

proof  of  Algorithm  RSE. 

Algorithms  CNTR-BENES  and  RSE  are  the  same,  except  that  Algorithm 
RSE  replaces  A  1;n  by  71:n.  Therefore,  the  complexities  of  these  algorithms  are 


-  26  - 


the  same,  that  is,  Algorithm  RSE  also  takes  O(AHogN)  time  in  serial  and 
0(log2yV)  time  in  parallel. 

The  following  example  uses  figures  to  illustrate  Algorithm  RSE. 

Example  V.l.  Let  N=  8  and  /sx3  =[*  1  *2  *3]-  In  this  example,  given  a 
permutation  6=(0  3  2  1  6)(4  7)(5),  the  matrix  Uy  5  =  [D1;2  B1:3 )  of  routing  tags 
to  realize  the  permutation  on  S2?1:2/P4S.£f. 3  is  determined.  Columns  of  Di  2 
can  be  determined  one  by  one  from  left  to  right  using  both  procedures 
COLUMN  and  COLUMN-PRAM.  Column  dj  must  be  such  that  [/2:3  dj]  and 
\B2  3  djj  are  balanced.  So,  dj  is  determined  by  a  2-labeling  of  the  union  of  the 
perfect  matching  graphs  of  /2;3  and  fi2;3,  as  shown  in  Figure  V.l.  Similarly,  d2 
is  determined  by  a  2-labeling  of  the  union  of  the  perfect  matching  graphs  of 
[i3  d x ]  and  (63  dj],  so  that  matrices  [*3  dx  d2]  and  [ft3  dj  d2]  are  balanced,  as 
shown  in  Figure  V.l.  Using  the  routing  scheme  described  in  Step  3  of  Algo¬ 
rithm  RSE,  the  switches  of  the  SEy^IP* SEyj  network  are  set  to  realize  the 
permutation  b,  as  shown  in  Figure  V.2.  End  of  Example. 


W  0>)  (c)  (d) 


Figure  V.l.  (a)  /,C/2saUj2a,  the  union  of  the  perfect  matching  graphs  of  /2;3  and  the 
solid  edges  (respectively,  the  dashed  edges)  form  the  perfect  matching  graph  of 
/n;3  (respectively,  52:3);  (b)  d,  obtained  from  a  2-labeling  of  PGt^3Ug^3;  (c) 
PG\ia  s,|u(»s  s,)>  4be  union  of  the  perfect  matching  graphs  of  [»3  d,]  and  [i3  d,j; 
the  solid  edges  (respectively,  the  dashed  edges)  form  the  perfect  matching  graph 
of  (i3  d,]  (respectively,  [63  dfj);  (d)  d2  obtained  from  a  2-labeling  of 

Because  Algorithm  RSE  is  the  same  as  Algorithm  CNTR-BENES  except 
that  Ri:n  is  replaced  by  /1:n,  the  minimum  number  of  settings  that  enables  a 

Btfxn  to  pass  through  SE SEYln  is  also  equal  to 

p- 1 


-  27  - 


VI.  CONCLUSIONS 

New  routing  algorithms  are  presented  in  this  paper  for  realizing  any  per¬ 
mutation  in  rearrangeable  networks  Benes  and  the  reduced  .  The  pro¬ 

posed  routing  algorithms  in  this  paper  are  easier  to  comprehend  and  they  take 
0(N\ogN)  time  in  serial  and  0(log2/V)  time  in  parallel.  These  algorithms  first 
compute  a  routing  tag  of  2n—  1  bits  for  every  input,  then  set  the  switches  one 
stage  at  a  time  such  that  the  switches  at  the  jth  stage  are  set  by  decoding  the 
jth  bits  of  the  pre-computed  routing  tags  at  their  inputs.  The  last  n  bits  of  any 
routing  tag  are  the  destination  address  bits.  The  routing  algorithm  of  the 
reduced  Qyv^yv'  follows  from  a  new  rearrangeability  proof  presented  in  this 
paper. 


10111 

11101 


8 1 11000 
7 1_01 10QJ 


~v 

SEl2IP 1 


11000 

01001 

10010 

00011 

01100 

11101 


00110 
-U0111J 


Figure  V.2.  The  SE  V2IP' SEj'j  network  with  8  inputs/outputs;  the  switches  are  set  accord¬ 
ing  to  the  routing  tags  matrix  C/1;S  =  [d,  d._.  6,  b2  63]  using  the  routing  scheme 
described  in  Step  3  of  Algorithm  RSE. 


VH.  APPENDIX 

Procedure  CONS-LIST(N)  constructs  a  linear  finked  fist  of  N  nodes,  called 
H,  each  of  which  contains  2  fields,  namely,  LEFT  and  RIGHT  which  are  the 
pointers  to  the  previous  and  the  following  node,  respectively.  For  1  <  t  <  N— 1, 
the  LEFT  field  of  the  ith  node  of  H  equals  i—  1.  For  0  <  »  <  N— 2,  the  RIGHT 
field  of  the  «th  node  of  H  equals  i +1.  Let  p  be  the  pointer  to  the  header  of  H 
and  initialized  to  0.  Procedure  DELETE(k)  deletes  the  node  H(k)  from  H.  In 
addition,  it  updates  the  pointer  if  H(k )  is  the  first  node  of  the  current  H. 


-  28  - 


procedure  CONSJLIST(N) 

Let  p  be  the  header  of  H  and  p  :=  0 
LEFT( 0)  :=  nil-  RIGHT (0)  :=  nil 

for  s  :  =  1  to  N— 1  do  LEFT(s)  :  =  s—  1;  RIGHT(s)  •  —  ntl;  RIGHT(s-l)  :=  s 
end 

end  CONS-LIST(N) 

procedure  DELETE(k) 

Let  p  be  the  header  of  H  and  p  0 
if  k=p  then  do  p  :=  RIGHT(p)-,  LEFT(p)  :=  nil ;  end 
else  RIGHT(LEFT(k))  -~  RIGHT [k) 
end  DELETE(k) 

Proof  of  Theorem  IV.2.  (•* — )  It  is  shown  that  if  Djvx*  fits  PJ^  then 

[-fyvxn  £>a/x*]  1S  balanced.  By  definition,  a  matrix  with  N  rows  and  more  than  n 
columns  is  balanced  if  every  n  consecutive  columns  form  a  balanced  matrix. 
Therefore,  it  suffices  to  show  that  for  any  j,  1  <  j  <  k,  [P(y+1):n  Dl:]\  is  bal¬ 
anced.  Partition  [P(y+1):n  D1:y)  into  2n-;  submatrices  Sz>xn  —  [P(Ay+1);n  D *y], 

0  <  h  <  2"~;— 1,  such  that  any  of  Syxn,  /2fy+i):n,  and  contains  2J  rows 
whose  indices  belong  to  the  same  block  of  the  partition  Py,  that  is,  the  row 
indices  of  Syxn>  Pfy+ i):B,  and  D*y  consist  of  the  numbers  ( hx2 ;)  to 
[(A+l)x27]— 1  inclusive.  D\j  has  2J  distinct  rows  because  it  is  a  balanced 
matrix  of  order  (2;xj).  Therefore,  no  matter  what  the  rows  of  /2*y+1^:rt  are,  the 
2]  rows  of  Sk  are  distinct.  On  the  other  hand,  because  of  the  definitions  of 
Rf/xn  an^  Pji  the  rows  of  Rk]+lyn  are  ah  identical  and  are  distinct  from  any 
other  row  of  R(j+i) :n-  It  follows  that  there  exist  no  common  rows  in  any  two 
different  submatrices  Sfyxri  and  S%jxn,  where  xj±y  and  0  <  j,y  <  2n'_J— 1. 
Therefore,  there  are  2r*_;  submatrices  S^xn  that  have  no  common  rows  and 
each  submatrix  has  2;  distinct  rows.  So,  the  total  number  of  distinct  rows  in 
[P(y+I):n  D\.j }  equals  2n=2B~-7x2;.  Because  [P(y+i):n  Dl:j]  is  of  order  Nxn,  by 
definition  II.2.1  the  matrix  [P(y+1);n  X?  1;y]  is  balanced. 

(— »)  It  is  shown  that  if  \R^Xn  1S  balanced  then  Sts  F'fixk-  Con¬ 

sider  the  balanced  matrix  [P(y+1);n  Z>1:yj.  By  definition  of  Rpfxn,  the  rows  of 
R(j+i):n  of  order  (2 Jxj)  (whose  indices  belong  to  the  same  biock  in  P} )  are 
identical.  By  hypothesis,  \Rf/xn  ^Nxk]  1S  balanced  and,  therefore,  [ii(y+i):n  X>i;y) 
must  also  be  balanced.  This  implies  that  D*y  (whose  indices  belong  to  the  same 
block  of  Py)  must  be  a  balanced  matrix  of  order  2 }xj  (because,  as  mentioned 
above,  the  rows  in  RkJ+1j:n  are  identical).  This  is  true  for  all  j=l,  ...,  k  and  all 
blocks  of  Py.  Therefore,  b>  the  definition  of  fit  (Definition  n.2.9)  and  the 
definition  of  standard  frame  Fffxk  (Definition  n.2.7),  Dffxk  fits  Ftfxh-  D 


-  29  - 


Theorem  IV. 2  establishes  a  relation  between  ine  reverse  permutation 
matrix  Rfjxn  and  F $**,  1  <  A:  <  n.  The  basic  idea  of  Theorem  IV.2  is  illus¬ 
trated  next  in  an  example. 


Example  IV.  1.  Let  N= 16  and  n=4.  The  reverse  permutation  matrix 
R  16x4 >  a  balanced  matrix  D  16x4  that  fits  F\ |x4,  ana  the  matrix  [/?  iex4  ^16x4] 
with  the  frame  Fi|x4  are  illustrated  in  Figure  IV.  1.  As  it  is  seen  from 
[R  16x4  ^16x4].  the  rows  of  R(J+1yn  are  all  identical  and  are  distinct  from  any 
other  row  of  /?(y+i);„.  Also,  notice  that  Z)j:;  is  a  balanced  matrix  of  order  2;xj, 
so  that  each  row  in  it  is  distinct.  End  of  example. 


0 

0000“ 

0 

0000“ 

0 

*0000 

0 

0 

0 

0 

1 

1000 

1 

1100 

1 

1000 

1 

1 

0 

0 

2 

0100 

2 

0100 

2 

0100 

0 

1 

0 

0 

3 

1100 

3 

1011 

3 

1100 

1 

0 

1 

1 

4 

0010 

4 

1000 

4 

0010 

1 

0 

0 

0 

5 

1010 

5 

0010 

5 

1010 

0 

0 

1 

0 

6 

0110 

6 

0111 

6 

0110 

0 

1 

1 

1 

7 

1110 

7 

1110 

7 

1110 

1 

1 

1 

0 

8 

0001 

8 

0011 

8 

0001 

0 

0 

1 

1 

9 

1001 

9 

1001 

9 

1001 

1 

0 

0 

1 

10 

0101 

10 

1111 

10 

0101 

1 

1 

1 

1 

11 

1101 

11 

0110 

11 

1101 

0 

1 

1 

0 

12 

0011 

12 

0101 

12 

0011 

0 

1 

0 

1 

13 

1011 

13 

1010 

13 

1011 

1 

0 

1 

0 

14 

0111 

14 

0001 

14 

0111 

0 

0 

0 

1 

15 

1111 

15 

1101 

15 

jm 

1 

1 

0 

1 

W  (b)  (C) 


Figure  IV.l.  (a)  The  reverse  permutation  matrix  f°r  JV=16.  (b)  A  D  1#x<  that  fits 

■FlSx^’  (C)  [-^18x4  ^IBx*]  with  F'jX4. 


REFERENCES 

[1]  D.  H.  Lawrie,  “Access  and  alignment  of  data  in  an  array  processor,”  IEEE 
Trans,  on  Comput.,  vol.  c-24,  pp.  1145-1155,  Dec.  1975. 

[2]  N.  Linial  and  M.  Tarsi,  “Interpolation  between  bases  and  the  Shuffle- 
Exchange  network,”  Europ.  J.  Combinatorics,  vol.  10,  pp.  29-39,  1989. 

[3]  J.  A.  Bondy  and  U.  S.  R.  Murty,  Graph  Theory  with  Applications,  North- 
Holland,  New  York,  1979. 

[4]  L.  Lovasz  and  M.  D.  Plummer,  Matching  Theory,  Annals  of  Discrete 
Math.,  29,  North-Holland,  1986. 

[5]  H.  S.  Stone,  “Parallel  processing  with  the  perfect  shuffle,”  IEEE  Trans,  on 
Comput.,  vol.  c-20,  pp.  153-161,  Feb.  1971. 


[6]  V.  E.  Benes,  Mathematical  Theory  of  Connecting  Networks  and  Telephone 
Traffic,  Academic  Press,  New  York,  1965. 

(7 J  A.  Waksman,  “A  permutation  network,”  Journal  of  the  ACM,  vol.  15,  No. 
1,  pp.  159-163,  Jan.  1968. 

[8]  C.  P.  Kruskal  and  M.  Snir,  “A  unified  theory  of  interconnection  network 
structure,”  Theoretical  Computer  Science,  vol.  48,  pp.  75-94,  1986. 

[9]  K.  Y.  Lee,  “On  the  rearrangeability  of  2(log  N)-l  stage  permutation  net¬ 
works,”  IEEE  Trans.  Comput.,  vol.  c-34,  pp.  412-425,  May  1985. 

[lOj  C.  Wu  and  T.  Feng,  “On  a  class  of  multistage  interconnection  networks,” 
IEEE  Trans,  on  Comput.,  vol.  c-29,  pp.  696-702,  1980. 

[l  1  ]  T.  Lang,  “Interconnections  between  processors  and  memory  modules  using 
the  Shuffle  Exchange  network,"  IEEE  Trans,  on  Comput.,  vol.  c-25,  No  5, 
May  1976. 

[12]  T.  Etzion  and  A.  Lempel,  “An  efficient  algorithm  for  generating  linear 
transformations  in  a  shuffle-exchange  network,”  SIAM  J.  Comput.,  Vol.  15, 
No.  1,  pp.  216-221,  Feb.  1986. 

[13]  K.Y.  Lee,  “A  new  Benes  network  control  algorithm,”  IEEE  Trans,  on 
Comput.,  vol.  c-36,  pp.  768-772,  June  1987. 

[14]  D.  Nassimi  and  S.  Sahni,  “A  self-routing  Benes  network  and  parallel  per¬ 
mutation  algorithms,”  IEEE  Trans,  on  Comput.,  vol.  c-30,  pp.  332-340, 
May  1981. 

[15]  - ,  “Parallel  algorithms  to  set  up  the  Benes  permutation  network,” 

IEEE  Trans,  on  Comput.,  vol.  c-31,  pp.  148-154,  Feb.  1982. 

[16]  D.C.  Opferman  and  N.T.  Tsao-Wu,  “On  a  class  of  rearrangeable  switching 
networks,  Part  I:  Control  algorithm,”  Bell  System  Tech.  J.,  vol.  50,  pp. 
1579-1600,  1971. 

[17]  H.  Cam,  Design  and  permutation  routing  algorithms  of  rearrangeable  net¬ 
works,  Ph.D.  Thesis,  Purdue  Univ.,  May  1992. 

[18]  H.  Cam  and  J.A.B.  Fortes,  “Frames:  a  simple  characterization  of  permuta¬ 
tions  realized  by  frequently  used  networks,”  submitted  for  publication. 

[19]  J.J.  Rotman,  The  theory  of  groups;  An  introduction,  Allyn  &  Bacon,  Rock- 
leigh,  N.J.,  1965. 

{20]  H.J.  Siegel,  Interconnection  Networks  for  Large-Scale  Parallel  Processing, 
Lexington  Books,  1986. 

[21]  T.H.  Cormen,  C.E.  Leiserson  and  R.L.  Rivest,  Introduction  to  Algorithms, 
MIT  Press,  1991. 


